[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help with Condor-PVM



Thanks Jeff,

jilguero (and all the other machines should be the same) seem to have PVM
installed, as you can see below. Funny thing is that I was planning to install
the condor-pvm contrib module, but I didn't, since it seems to be already there,
despite the instructions in the manual, that say that shouldn't be there.

What is not clear to me is whether PVM itself has to be installed,
i.e. Condor-pvm uses my installed version of PVM, or Condor-pvm replaces
entirely PVM so I don't need to have it installed? In any case PVM 3.4.2 seems
to be installed correctly in my home directory.

I just don't know what those messages like  
> > 12/7 11:03:57 (457.0) (30843):Can't start new machines now{ jilguero.iac.es}

do really mean. Does it mean that in principle it should be able to start a new
machine but for some reason it cannot NOW, or does simply mean that it won't
ever be able to start a machine there?

Thanks,
Angel de Vicente

[angelv@guinda ~]$ condor_status -long jilguero.iac.es
MyType = "Machine"
TargetType = "Job"
Name = "jilguero.iac.es"
Machine = "jilguero.iac.es"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "afrodita"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
[...]
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
[...]




Jeff Linderoth writes:
 > Hi Angel,
 > 
 > I have used Condor-PVM with 6.6.6 and also with 6.7.2 (for the most
 > part)  successfully.  I usually use Condor-PVM through the MW tool.
 > 
 > What does
 > condor_status -long tell you about the StarterAbilityList on the
 > machines in your pool?
 > (Like jilguero.iac.es)
 > 
 > If StarterAbilityList doesn't contain HasPVM, you will likely have
 > problems...
 > 
 > If you have installed the condor-pvm contrib module (the
 > condor_starter.pvm), etc, you may need to restart the daemons (or at
 > least reconfig the daemons) to get the HasPVM attribute in the
 > StarterAbilityList...
 > 
 > I hope this helps.  Maybe someone on the Condor Team can shed more
 > light...
 > 
 > Best of luck.
 > 
 > Cheers,
 > -Jeff
 > 
 > 
 > 
 > 
 > On Thu, 2004-12-09 at 14:37 +0000, Angel de Vicente wrote:
 > > Hi,
 > > 
 > > I sent an e-mail last week explaining my Condor-PVM problems. Since then, I have
 > > tried to recompile a sample code with PVM version 3.4.2, which is supposed to be
 > > the one Condor-PVM supports, but I get exactly the same problems.
 > > 
 > > Basically I would like to know if there is anyone out there with Condor 6.6.7 on
 > > Linux running Condor-PVM successfully. The instructions in the Condor manual are
 > > not very clear on PVM, so I'm not sure what (if anything) I'm doing wrong.
 > > 
 > > Any help? (I include below the ShadowLog for the submitting machine)
 > > 
 > > Thanks a lot,
 > > Angel de Vicente
 > > 
 > > 
 > > -------------------------------------------------------------
 > > 
 > > 
 > > 12/7 11:03:53 (?.?) (30843):********** Multi_Shadow starting up **********
 > > 12/7 11:03:53 (?.?) (30843):uid=0, euid=2120, gid=0, egid=20
 > > 12/7 11:03:53 (?.?) (30843):My_Filesystem_Domain = "iac.es"
 > > 12/7 11:03:53 (?.?) (30843):My_UID_Domain = "iac.es"
 > > 12/7 11:03:53 (?.?) (30843):Shadow reading via ASCII
 > > 12/7 11:03:53 (?.?) (30843):First Line: 457 0 1
 > > 12/7 11:03:53 (457.0) (30843):Created class:
 > > 12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 0
 > > 12/7 11:03:53 (457.0) (30843):New process for proc 0
 > > 12/7 11:03:53 (457.0) (30843):AllocProc() returning 0
 > > 12/7 11:03:53 (457.0) (30843):Machine from schedd: <161.72.80.28:32792> <161.72.80.28:32792>#4144361396 0
 > > 12/7 11:03:53 (457.0) (30843):Machine Line: canistel.iac.es 0
 > > 12/7 11:03:53 (457.0) (30843):Machines now cur = 1 desire = 2
 > > 12/7 11:03:53 (457.0) (30843):Updated class:
 > > 12/7 11:03:53 (457.0) (30843):#0: 0 (1, 2) has 1
 > > 12/7 11:03:53 (457.0) (30843):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
 > > 12/7 11:03:54 (457.0) (30843):PVM is pid 30852
 > > 12/7 11:03:54 (457.0) (30843):pvmd response: /tmp/fileSdOmBH
 > > 12/7 11:03:54 (457.0) (30843):PVMSOCK=/tmp/fileSdOmBH
 > > 12/7 11:03:54 (457.0) (30843):pvm_fd = 4, mytid = t40001
 > > 12/7 11:03:54 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:54 (457.0) (30843):Ok to start waiting hosts
 > > 12/7 11:03:55 (457.0) (30843):Shadow reading via ASCII
 > > 12/7 11:03:55 (457.0) (30843):First Line: 457 0 1
 > > 12/7 11:03:55 (457.0) (30843):Machine from schedd: <161.72.81.36:32792> <161.72.81.36:32792>#2835371380 0
 > > 12/7 11:03:55 (457.0) (30843):Machine Line: jilguero.iac.es 0
 > > 12/7 11:03:55 (457.0) (30843):Machines now cur = 2 desire = 2
 > > 12/7 11:03:55 (457.0) (30843):Updated class:
 > > 12/7 11:03:55 (457.0) (30843):#0: 0 (1, 2) has 2
 > > 12/7 11:03:55 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:55 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
 > > 12/7 11:03:55 (457.0) (30843):PVMd message is SM_STHOST from t80000000
 > > 12/7 11:03:55 (457.0) (30843):SM_STHOST: 80000 "" "161.72.80.28" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncanistel.iac.es 1 a14851bb:88fc 4080 2 a148501c:0000"
 > > 12/7 11:03:55 (457.0) (30843):New process for proc 0
 > > 12/7 11:03:55 (457.0) (30843):AllocProc() returning 0
 > > 12/7 11:03:55 (457.0) (30843):Shadow: Entering multi_send_job(canistel.iac.es)
 > > 12/7 11:03:56 (457.0) (30843):Requesting Alternate Starter 1
 > > 12/7 11:03:56 (457.0) (30843):Shadow: Request to run a job was REFUSED
 > > 12/7 11:03:56 (457.0) (30843):RemoveHost: Sending HostDelete notify on t80080000
 > > 12/7 11:03:56 (457.0) (30843):SendNotification(kind = 2, tid = t80080000)
 > > 12/7 11:03:56 (457.0) (30843):signal_startd( canistel.iac.es, 443 )
 > > 12/7 11:03:56 (457.0) (30843):Adding host canistel.iac.es to STARTACK msg.
 > > 12/7 11:03:56 (457.0) (30843):Num Hosts to pack = 1
 > > 12/7 11:03:56 (457.0) (30843):Packing tid t80000 with reply PvmNoHost
 > > 12/7 11:03:56 (457.0) (30843):Sending SM_STHOSTACK to PVMd
 > > 12/7 11:03:56 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:56 (457.0) (30843):Can't start new machines now{ canistel.iac.es}
 > > 12/7 11:03:57 (457.0) (30843):Updated class:
 > > 12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 1
 > > 12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
 > > 12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
 > > 12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
 > > 12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
 > > 12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
 > > 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
 > > 12/7 11:03:57 (457.0) (30843):PVMd message is SM_STHOST from t80000000
 > > 12/7 11:03:57 (457.0) (30843):SM_STHOST: c0000 "" "161.72.81.36" "$PVM_ROOT/lib/pvmd -s -d0x11c -njilguero.iac.es 1 a14851bb:88fc 4080 3 a1485124:0000"
 > > 12/7 11:03:57 (457.0) (30843):New process for proc 0
 > > 12/7 11:03:57 (457.0) (30843):AllocProc() returning 0
 > > 12/7 11:03:57 (457.0) (30843):Shadow: Entering multi_send_job(jilguero.iac.es)
 > > 12/7 11:03:57 (457.0) (30843):Requesting Alternate Starter 1
 > > 12/7 11:03:57 (457.0) (30843):Shadow: Request to run a job was REFUSED
 > > 12/7 11:03:57 (457.0) (30843):RemoveHost: Sending HostDelete notify on t800c0000
 > > 12/7 11:03:57 (457.0) (30843):SendNotification(kind = 2, tid = t800c0000)
 > > 12/7 11:03:57 (457.0) (30843):signal_startd( jilguero.iac.es, 443 )
 > > 12/7 11:03:57 (457.0) (30843):Adding host jilguero.iac.es to STARTACK msg.
 > > 12/7 11:03:57 (457.0) (30843):Num Hosts to pack = 1
 > > 12/7 11:03:57 (457.0) (30843):Packing tid tc0000 with reply PvmNoHost
 > > 12/7 11:03:57 (457.0) (30843):Sending SM_STHOSTACK to PVMd
 > > 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:57 (457.0) (30843):Can't start new machines now{ jilguero.iac.es}
 > > 12/7 11:03:57 (457.0) (30843):Updated class:
 > > 12/7 11:03:57 (457.0) (30843):#0: 0 (1, 2) has 0
 > > 12/7 11:03:57 (457.0) (30843):PVMd message is SM_ADDACK from t80000000
 > > 12/7 11:03:57 (457.0) (30843):pvmd reports error -6 on SM_ADDACK: PvmNoHost
 > > 12/7 11:03:57 (457.0) (30843):pvm_machines_starting = 0(should be 0)
 > > 12/7 11:03:57 (457.0) (30843):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
 > > 12/7 11:03:57 (457.0) (30843):Not enough machines in class 0 to start local proc.
 > > 12/7 11:03:57 (457.0) (30843):Entered StartWaitingHosts()
 > > 12/7 11:03:57 (457.0) (30843):Ok to start waiting hosts
 > > 12/7 11:07:52 (457.0) (30843):Got SIGUSR1
 > > 12/7 11:07:52 (457.0) (30843):Multi_Shadow: Shutting down...
 > > 12/7 11:07:53 (457.0) (30843):Updated class:
 > > 12/7 11:07:53 (457.0) (30843):#0: 0 (1, 2) has 0
 > > 12/7 11:07:53 (457.0) (30843):in new_timer()
 > > 12/7 11:07:53 (457.0) (30843):Timer List
 > > 12/7 11:07:53 (457.0) (30843):^^^^^ ^^^^
 > > 12/7 11:07:53 (457.0) (30843):id = 0, when = 300
 > > 12/7 11:07:53 (457.0) (30843):deadpid = 30852
 > > 12/7 11:07:53 (457.0) (30843):Lost local pvmd termsig = 9, retcode = 0
 > > 12/7 11:07:53 (457.0) (30843):deadpid = -1
 > > 12/7 11:07:53 (457.0) (30843):No more dead processes(errno = 10)
 > > 12/7 11:07:53 (457.0) (30843):MultiShadow Exiting!!!
 > > 12/7 11:07:53 (457.0) (30843):********** Shadow Parent Exiting(4) **********
 > > 
 > -- 
 > ------------------------------------------------------------
 > Jeff Linderoth                               O: 610-758-4879
 > Asst. Professor                              
 > Industrial and Systems Engineering           jtl3@xxxxxxxxxx
 > Lehigh University                       www.lehigh.edu/~jtl3          
 > 
 > _______________________________________________
 > Condor-users mailing list
 > Condor-users@xxxxxxxxxxx
 > http://lists.cs.wisc.edu/mailman/listinfo/condor-users

-- 
----------------------------------
http://www.iac.es/galeria/angelv/

PostDoc Software Support
Instituto de Astrofisica de Canarias