[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help with Condor-PVM



Hi Angel,

I have used Condor-PVM with 6.6.6 and also with 6.7.2 (for the most
part)  successfully.  I usually use Condor-PVM through the MW tool.

What does
condor_status -long tell you about the StarterAbilityList on the
machines in your pool?
(Like jilguero.iac.es)

If StarterAbilityList doesn't contain HasPVM, you will likely have
problems...

If you have installed the condor-pvm contrib module (the
condor_starter.pvm), etc, you may need to restart the daemons (or at
least reconfig the daemons) to get the HasPVM attribute in the
StarterAbilityList...

I hope this helps.  Maybe someone on the Condor Team can shed more
light...

Best of luck.

Cheers,
-Jeff




On Thu, 2004-12-09 at 14:37 +0000, Angel de Vicente wrote:
> Hi,
> 
> I sent an e-mail last week explaining my Condor-PVM problems. Since then, I have
> tried to recompile a sample code with PVM version 3.4.2, which is supposed to be
> the one Condor-PVM supports, but I get exactly the same problems.
> 
> Basically I would like to know if there is anyone out there with Condor 6.6.7 on
> Linux running Condor-PVM successfully. The instructions in the Condor manual are
> not very clear on PVM, so I'm not sure what (if anything) I'm doing wrong.
> 
> Any help? (I include below the ShadowLog for the submitting machine)
> 
> Thanks a lot,
> Angel de Vicente
> 
> 
> -------------------------------------------------------------
> 
> 
> 12/7 11:03:53 (?.?) (30843):********** Multi_Shadow starting up **********
> 12/7 11:03:53 (?.?) (30843):uid=0, euid=2120, gid=0, egid=20
> 12/7 11:03:53 (?.?) (30843):My_Filesystem_Domain = "iac.es"
> 12/7 11:03:53 (?.?) (30843):My_UID_Domain = "iac.es"
> 12/7 11:03:53 (?.?) (30843):Shadow reading via ASCII
> 12/7 11:03:53 (?.?) (30843):First Line: 457 0 1
> 12/7 11:03:53 (457.0) (30843):Created class:
> 12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 0
> 12/7 11:03:53 (457.0) (30843):New process for proc 0
> 12/7 11:03:53 (457.0) (30843):AllocProc() returning 0
> 12/7 11:03:53 (457.0) (30843):Machine from schedd: <161.72.80.28:32792> <161.72.80.28:32792>#4144361396 0
> 12/7 11:03:53 (457.0) (30843):Machine Line: canistel.iac.es 0
> 12/7 11:03:53 (457.0) (30843):Machines now cur = 1 desire = 2
> 12/7 11:03:53 (457.0) (30843):Updated class:
> 12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 1
> 12/7 11:03:53 (457.0) (30843):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
> 12/7 11:03:54 (457.0) (30843):PVM is pid 30852
> 12/7 11:03:54 (457.0) (30843):pvmd response: /tmp/fileSdOmBH
> 12/7 11:03:54 (457.0) (30843):PVMSOCK=/tmp/fileSdOmBH
> 12/7 11:03:54 (457.0) (30843):pvm_fd = 4, mytid = t40001
> 12/7 11:03:54 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:54 (457.0) (30843):Ok to start waiting hosts
> 12/7 11:03:55 (457.0) (30843):Shadow reading via ASCII
> 12/7 11:03:55 (457.0) (30843):First Line: 457 0 1
> 12/7 11:03:55 (457.0) (30843):Machine from schedd: <161.72.81.36:32792> <161.72.81.36:32792>#2835371380 0
> 12/7 11:03:55 (457.0) (30843):Machine Line: jilguero.iac.es 0
> 12/7 11:03:55 (457.0) (30843):Machines now cur = 2 desire = 2
> 12/7 11:03:55 (457.0) (30843):Updated class:
> 12/7 11:03:55 (457.0) (30843):#0: 0 (1, 2) has 2
> 12/7 11:03:55 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:55 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
> 12/7 11:03:55 (457.0) (30843):PVMd message is SM_STHOST from t80000000
> 12/7 11:03:55 (457.0) (30843):SM_STHOST: 80000 "" "161.72.80.28" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncanistel.iac.es 1 a14851bb:88fc 4080 2 a148501c:0000"
> 12/7 11:03:55 (457.0) (30843):New process for proc 0
> 12/7 11:03:55 (457.0) (30843):AllocProc() returning 0
> 12/7 11:03:55 (457.0) (30843):Shadow: Entering multi_send_job(canistel.iac.es)
> 12/7 11:03:56 (457.0) (30843):Requesting Alternate Starter 1
> 12/7 11:03:56 (457.0) (30843):Shadow: Request to run a job was REFUSED
> 12/7 11:03:56 (457.0) (30843):RemoveHost: Sending HostDelete notify on t80080000
> 12/7 11:03:56 (457.0) (30843):SendNotification(kind = 2, tid = t80080000)
> 12/7 11:03:56 (457.0) (30843):signal_startd( canistel.iac.es, 443 )
> 12/7 11:03:56 (457.0) (30843):Adding host canistel.iac.es to STARTACK msg.
> 12/7 11:03:56 (457.0) (30843):Num Hosts to pack = 1
> 12/7 11:03:56 (457.0) (30843):Packing tid t80000 with reply PvmNoHost
> 12/7 11:03:56 (457.0) (30843):Sending SM_STHOSTACK to PVMd
> 12/7 11:03:56 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:56 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
> 12/7 11:03:57 (457.0) (30843):Updated class:
> 12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 1
> 12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
> 12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
> 12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
> 12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
> 12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
> 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
> 12/7 11:03:57 (457.0) (30843):PVMd message is SM_STHOST from t80000000
> 12/7 11:03:57 (457.0) (30843):SM_STHOST: c0000 "" "161.72.81.36" "$PVM_ROOT/lib/pvmd -s -d0x11c -njilguero.iac.es 1 a14851bb:88fc 4080 3 a1485124:0000"
> 12/7 11:03:57 (457.0) (30843):New process for proc 0
> 12/7 11:03:57 (457.0) (30843):AllocProc() returning 0
> 12/7 11:03:57 (457.0) (30843):Shadow: Entering multi_send_job(jilguero.iac.es)
> 12/7 11:03:57 (457.0) (30843):Requesting Alternate Starter 1
> 12/7 11:03:57 (457.0) (30843):Shadow: Request to run a job was REFUSED
> 12/7 11:03:57 (457.0) (30843):RemoveHost: Sending HostDelete notify on t800c0000
> 12/7 11:03:57 (457.0) (30843):SendNotification(kind = 2, tid = t800c0000)
> 12/7 11:03:57 (457.0) (30843):signal_startd( jilguero.iac.es, 443 )
> 12/7 11:03:57 (457.0) (30843):Adding host jilguero.iac.es to STARTACK msg.
> 12/7 11:03:57 (457.0) (30843):Num Hosts to pack = 1
> 12/7 11:03:57 (457.0) (30843):Packing tid tc0000 with reply PvmNoHost
> 12/7 11:03:57 (457.0) (30843):Sending SM_STHOSTACK to PVMd
> 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:57 (457.0) (30843):Can't start new machines now{ jilguero.iac.es}
> 12/7 11:03:57 (457.0) (30843):Updated class:
> 12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 0
> 12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
> 12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
> 12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
> 12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
> 12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
> 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
> 12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
> 12/7 11:07:52 (457.0) (30843):Got SIGUSR1
> 12/7 11:07:52 (457.0) (30843):Multi_Shadow: Shutting down...
> 12/7 11:07:53 (457.0) (30843):Updated class:
> 12/7 11:07:53 (457.0) (30843):#0: 0 (1, 2) has 0
> 12/7 11:07:53 (457.0) (30843):in new_timer()
> 12/7 11:07:53 (457.0) (30843):Timer List
> 12/7 11:07:53 (457.0) (30843):^^^^^ ^^^^
> 12/7 11:07:53 (457.0) (30843):id = 0, when = 300
> 12/7 11:07:53 (457.0) (30843):deadpid = 30852
> 12/7 11:07:53 (457.0) (30843):Lost local pvmd termsig = 9, retcode = 0
> 12/7 11:07:53 (457.0) (30843):deadpid = -1
> 12/7 11:07:53 (457.0) (30843):No more dead processes(errno = 10)
> 12/7 11:07:53 (457.0) (30843):MultiShadow Exiting!!!
> 12/7 11:07:53 (457.0) (30843):********** Shadow Parent Exiting(4) **********
> 
-- 
------------------------------------------------------------
Jeff Linderoth                               O: 610-758-4879
Asst. Professor                              
Industrial and Systems Engineering           jtl3@xxxxxxxxxx
Lehigh University                       www.lehigh.edu/~jtl3