[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help with Condor-PVM



Hi,

I sent an e-mail last week explaining my Condor-PVM problems. Since then, I have
tried to recompile a sample code with PVM version 3.4.2, which is supposed to be
the one Condor-PVM supports, but I get exactly the same problems.

Basically I would like to know if there is anyone out there with Condor 6.6.7 on
Linux running Condor-PVM successfully. The instructions in the Condor manual are
not very clear on PVM, so I'm not sure what (if anything) I'm doing wrong.

Any help? (I include below the ShadowLog for the submitting machine)

Thanks a lot,
Angel de Vicente


-------------------------------------------------------------


12/7 11:03:53 (?.?) (30843):********** Multi_Shadow starting up **********
12/7 11:03:53 (?.?) (30843):uid=0, euid=2120, gid=0, egid=20
12/7 11:03:53 (?.?) (30843):My_Filesystem_Domain = "iac.es"
12/7 11:03:53 (?.?) (30843):My_UID_Domain = "iac.es"
12/7 11:03:53 (?.?) (30843):Shadow reading via ASCII
12/7 11:03:53 (?.?) (30843):First Line: 457 0 1
12/7 11:03:53 (457.0) (30843):Created class:
12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 0
12/7 11:03:53 (457.0) (30843):New process for proc 0
12/7 11:03:53 (457.0) (30843):AllocProc() returning 0
12/7 11:03:53 (457.0) (30843):Machine from schedd: <161.72.80.28:32792> <161.72.80.28:32792>#4144361396 0
12/7 11:03:53 (457.0) (30843):Machine Line: canistel.iac.es 0
12/7 11:03:53 (457.0) (30843):Machines now cur = 1 desire = 2
12/7 11:03:53 (457.0) (30843):Updated class:
12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 1
12/7 11:03:53 (457.0) (30843):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
12/7 11:03:54 (457.0) (30843):PVM is pid 30852
12/7 11:03:54 (457.0) (30843):pvmd response: /tmp/fileSdOmBH
12/7 11:03:54 (457.0) (30843):PVMSOCK=/tmp/fileSdOmBH
12/7 11:03:54 (457.0) (30843):pvm_fd = 4, mytid = t40001
12/7 11:03:54 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:54 (457.0) (30843):Ok to start waiting hosts
12/7 11:03:55 (457.0) (30843):Shadow reading via ASCII
12/7 11:03:55 (457.0) (30843):First Line: 457 0 1
12/7 11:03:55 (457.0) (30843):Machine from schedd: <161.72.81.36:32792> <161.72.81.36:32792>#2835371380 0
12/7 11:03:55 (457.0) (30843):Machine Line: jilguero.iac.es 0
12/7 11:03:55 (457.0) (30843):Machines now cur = 2 desire = 2
12/7 11:03:55 (457.0) (30843):Updated class:
12/7 11:03:55 (457.0) (30843):#0: 0 (1, 2) has 2
12/7 11:03:55 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:55 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
12/7 11:03:55 (457.0) (30843):PVMd message is SM_STHOST from t80000000
12/7 11:03:55 (457.0) (30843):SM_STHOST: 80000 "" "161.72.80.28" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncanistel.iac.es 1 a14851bb:88fc 4080 2 a148501c:0000"
12/7 11:03:55 (457.0) (30843):New process for proc 0
12/7 11:03:55 (457.0) (30843):AllocProc() returning 0
12/7 11:03:55 (457.0) (30843):Shadow: Entering multi_send_job(canistel.iac.es)
12/7 11:03:56 (457.0) (30843):Requesting Alternate Starter 1
12/7 11:03:56 (457.0) (30843):Shadow: Request to run a job was REFUSED
12/7 11:03:56 (457.0) (30843):RemoveHost: Sending HostDelete notify on t80080000
12/7 11:03:56 (457.0) (30843):SendNotification(kind = 2, tid = t80080000)
12/7 11:03:56 (457.0) (30843):signal_startd( canistel.iac.es, 443 )
12/7 11:03:56 (457.0) (30843):Adding host canistel.iac.es to STARTACK msg.
12/7 11:03:56 (457.0) (30843):Num Hosts to pack = 1
12/7 11:03:56 (457.0) (30843):Packing tid t80000 with reply PvmNoHost
12/7 11:03:56 (457.0) (30843):Sending SM_STHOSTACK to PVMd
12/7 11:03:56 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:56 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
12/7 11:03:57 (457.0) (30843):Updated class:
12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 1
12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
12/7 11:03:57 (457.0) (30843):PVMd message is SM_STHOST from t80000000
12/7 11:03:57 (457.0) (30843):SM_STHOST: c0000 "" "161.72.81.36" "$PVM_ROOT/lib/pvmd -s -d0x11c -njilguero.iac.es 1 a14851bb:88fc 4080 3 a1485124:0000"
12/7 11:03:57 (457.0) (30843):New process for proc 0
12/7 11:03:57 (457.0) (30843):AllocProc() returning 0
12/7 11:03:57 (457.0) (30843):Shadow: Entering multi_send_job(jilguero.iac.es)
12/7 11:03:57 (457.0) (30843):Requesting Alternate Starter 1
12/7 11:03:57 (457.0) (30843):Shadow: Request to run a job was REFUSED
12/7 11:03:57 (457.0) (30843):RemoveHost: Sending HostDelete notify on t800c0000
12/7 11:03:57 (457.0) (30843):SendNotification(kind = 2, tid = t800c0000)
12/7 11:03:57 (457.0) (30843):signal_startd( jilguero.iac.es, 443 )
12/7 11:03:57 (457.0) (30843):Adding host jilguero.iac.es to STARTACK msg.
12/7 11:03:57 (457.0) (30843):Num Hosts to pack = 1
12/7 11:03:57 (457.0) (30843):Packing tid tc0000 with reply PvmNoHost
12/7 11:03:57 (457.0) (30843):Sending SM_STHOSTACK to PVMd
12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:57 (457.0) (30843):Can't start new machines now{ jilguero.iac.es}
12/7 11:03:57 (457.0) (30843):Updated class:
12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 0
12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
12/7 11:07:52 (457.0) (30843):Got SIGUSR1
12/7 11:07:52 (457.0) (30843):Multi_Shadow: Shutting down...
12/7 11:07:53 (457.0) (30843):Updated class:
12/7 11:07:53 (457.0) (30843):#0: 0 (1, 2) has 0
12/7 11:07:53 (457.0) (30843):in new_timer()
12/7 11:07:53 (457.0) (30843):Timer List
12/7 11:07:53 (457.0) (30843):^^^^^ ^^^^
12/7 11:07:53 (457.0) (30843):id = 0, when = 300
12/7 11:07:53 (457.0) (30843):deadpid = 30852
12/7 11:07:53 (457.0) (30843):Lost local pvmd termsig = 9, retcode = 0
12/7 11:07:53 (457.0) (30843):deadpid = -1
12/7 11:07:53 (457.0) (30843):No more dead processes(errno = 10)
12/7 11:07:53 (457.0) (30843):MultiShadow Exiting!!!
12/7 11:07:53 (457.0) (30843):********** Shadow Parent Exiting(4) **********

-- 
----------------------------------
http://www.iac.es/galeria/angelv/

PostDoc Software Support
Instituto de Astrofisica de Canarias