[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems running a test PVM job with Condor



Hi,

I'm starting to play with PVM and Condor-PVM, but no success yet.

I am running Condor-6.6.7 in ~ 200 machines with no problems (around 100 Linux,
from where I am sending the job).

I downloaded the latest version of PVM (3.4.4) today, compiled it and tried
successfully to compile and run one of the example programs that comes with the
PVM distribution: master1 slave1

But when I try with Condor-PVM, no success. Does anybody with experience with
Condor-PVM know what could be happening? I include below all the details.

Thanks a lot,
Angel de Vicente

ps. By the way, is ther a MW mailing list? The information at
http://www.cs.wisc.edu/condor/mw/ does not seem to be up-to-date.

--------------------------------------

With the pvm console, the master1 programs works OK

[angelv@guinda PVM]$ pvm
pvm> add filomena
add filomena
1 successful
                    HOST     DTID
                filomena    80000
pvm> conf
conf
2 hosts, 1 data format
                    HOST     DTID     ARCH   SPEED       DSIG
                  guinda    40000    LINUX    1000 0x00408841
                filomena    80000    LINUX    1000 0x00408841
pvm> spawn -> master1
spawn -> master1
[1]
1 successful
t80001
pvm> [1:t40003] EOF
[1:t40004] EOF
[1:t40002] EOF
[1:t80003] EOF
[1:t80004] EOF
[1:t80002] EOF
[1:t80001] Spawning 6 worker tasks ... SUCCESSFUL
[1:t80001] I got 700.000000 from 4; (expecting 700.000000)
[1:t80001] I got 900.000000 from 5; (expecting 900.000000)
[1:t80001] I got 500.000000 from 3; (expecting 500.000000)
[1:t80001] I got 100.000000 from 1; (expecting 100.000000)
[1:t80001] I got 300.000000 from 2; (expecting 300.000000)
[1:t80001] I got 500.000000 from 0; (expecting 500.000000)
[1:t80001] EOF
[1] finished


pvm>


Since the documentation of Condor (section 2.9.2) says that the PVM and
Condor-PVM are binary compatible, I tried to run the master1/slave1 program.

my submit file is:
-----------------

universe = PVM

executable = master1

output = out.dat
error = err.dat
log = pvm.log

Requirements = (Arch == "INTEL") && (OpSys == "LINUX")

machine_count = 1..2
queue


I send it to the queue, and it says that it is running, but it sits there too
long and I don't get anything in the output or error.

[angelv@guinda PVM]$ condor_q


-- Submitter: guinda.iac.es : <161.72.81.187:30045> : guinda.iac.es
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 457.0   angelv         12/3  17:05   0+00:01:43 R  0   0.1  master1

1 jobs; 0 idle, 1 running, 0 held
[angelv@guinda PVM]$


It seems that something started OK


[angelv@guinda PVM]$ ps -aux | grep pvm
angelv   23873  0.0  0.5  7964 2720 ?        S    17:06   0:00 condor_shadow.pvm <161.72.81.187:30064>
angelv   23874  0.0  0.1  1704  728 ?        S    17:06   0:00 /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c


And the ShadowLog of guinda seems to be trying to open other pvmds in the other
machines, but I do not know why they fail. Here there are the last the lines in
the ShadowLog of guinda. It looks like everything is going fine, but then there
is a line like 

12/3 17:05:16 (457.0) (23809):Can't start new machines now{ filomena.iac.es}

Any ideas what could be wrong?

12/3 17:05:16 (?.?) (23809):********** Multi_Shadow starting up **********
12/3 17:05:16 (?.?) (23809):uid=0, euid=2120, gid=0, egid=20
12/3 17:05:16 (?.?) (23809):My_Filesystem_Domain = "iac.es"
12/3 17:05:16 (?.?) (23809):My_UID_Domain = "iac.es"
12/3 17:05:16 (?.?) (23809):Shadow reading via ASCII
12/3 17:05:16 (?.?) (23809):First Line: 457 0 1
12/3 17:05:16 (457.0) (23809):Created class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 0
12/3 17:05:16 (457.0) (23809):New process for proc 0
12/3 17:05:16 (457.0) (23809):AllocProc() returning 0
12/3 17:05:16 (457.0) (23809):Machine from schedd: <161.72.80.41:46440> <161.72.80.41:46440>#1825676418 0
12/3 17:05:16 (457.0) (23809):Machine Line: filomena.iac.es 0
12/3 17:05:16 (457.0) (23809):Machines now cur = 1 desire = 2
12/3 17:05:16 (457.0) (23809):Updated class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:05:16 (457.0) (23809):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
12/3 17:05:16 (457.0) (23809):PVM is pid 23810
12/3 17:05:16 (457.0) (23809):pvmd response: /tmp/fileMQop00
12/3 17:05:16 (457.0) (23809):PVMSOCK=/tmp/fileMQop00
12/3 17:05:16 (457.0) (23809):pvm_fd = 4, mytid = t40001
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Ok to start waiting hosts
12/3 17:05:16 (457.0) (23809):PVMd message is SM_STHOST from t80000000
12/3 17:05:16 (457.0) (23809):SM_STHOST: 80000 "" "161.72.80.41" "$PVM_ROOT/lib/pvmd -s -d0x11c -nfilomena.iac.es 1 a14851bb:8043 4080 2 a1485029:0000"
12/3 17:05:16 (457.0) (23809):New process for proc 0
12/3 17:05:16 (457.0) (23809):AllocProc() returning 0
12/3 17:05:16 (457.0) (23809):Shadow: Entering multi_send_job(filomena.iac.es)
12/3 17:05:16 (457.0) (23809):Requesting Alternate Starter 1
12/3 17:05:16 (457.0) (23809):Shadow: Request to run a job was REFUSED
12/3 17:05:16 (457.0) (23809):RemoveHost: Sending HostDelete notify on t80080000
12/3 17:05:16 (457.0) (23809):SendNotification(kind = 2, tid = t80080000)
12/3 17:05:16 (457.0) (23809):signal_startd( filomena.iac.es, 443 )
12/3 17:05:16 (457.0) (23809):Adding host filomena.iac.es to STARTACK msg.
12/3 17:05:16 (457.0) (23809):Num Hosts to pack = 1
12/3 17:05:16 (457.0) (23809):Packing tid t80000 with reply PvmNoHost
12/3 17:05:16 (457.0) (23809):Sending SM_STHOSTACK to PVMd
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Can't start new machines now{ filomena.iac.es}
12/3 17:05:16 (457.0) (23809):Updated class:
12/3 17:05:16 (457.0) (23809):#0: 0 (1, 2) has 0
12/3 17:05:16 (457.0) (23809):PVMd message is SM_ADDACK from t80000000
12/3 17:05:16 (457.0) (23809):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/3 17:05:16 (457.0) (23809):pvm_machines_starting = 0(should be 0)
12/3 17:05:16 (457.0) (23809):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:05:16 (457.0) (23809):Not enough machines in class 0 to start local proc.
12/3 17:05:16 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:05:16 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:13 (457.0) (23809):Shadow reading via ASCII
12/3 17:06:13 (457.0) (23809):First Line: 457 0 1
12/3 17:06:13 (457.0) (23809):Machine from schedd: <161.72.81.147:32804> <161.72.81.147:32804>#1709993205 0
12/3 17:06:13 (457.0) (23809):Machine Line: calendula.iac.es 0
12/3 17:06:13 (457.0) (23809):Machines now cur = 1 desire = 2
12/3 17:06:13 (457.0) (23809):Updated class:
12/3 17:06:13 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:06:13 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:06:13 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:13 (457.0) (23809):PVMd message is SM_STHOST from t80000000
12/3 17:06:13 (457.0) (23809):SM_STHOST: c0000 "" "161.72.81.147" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncalendula.iac.es 1 a14851bb:8043 4080 3 a1485193:0000"
12/3 17:06:13 (457.0) (23809):New process for proc 0
12/3 17:06:13 (457.0) (23809):AllocProc() returning 0
12/3 17:06:13 (457.0) (23809):Shadow: Entering multi_send_job(calendula.iac.es)
12/3 17:06:13 (457.0) (23809):Requesting Alternate Starter 1
12/3 17:06:13 (457.0) (23809):Shadow: Request to run a job was ACCEPTED
12/3 17:06:13 (457.0) (23809):Shadow: RSC_SOCK connected, fd = 6
12/3 17:06:13 (457.0) (23809):Multi_Shadow: CLIENT_LOG connected, fd = 7
12/3 17:06:13 (457.0) (23809):in new_timer()
12/3 17:06:13 (457.0) (23809):Timer List
12/3 17:06:13 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:13 (457.0) (23809):id = 0, when = 180
12/3 17:06:14 (457.0) (23809):Shadow: send_pvm_job_info
12/3 17:06:14 (457.0) (23809):send_pvm_job_info: arg =  -s -d0x11c -ncalendula.iac.es 1 a14851bb:8043 4080 3 a1485193:0000 -f
12/3 17:06:14 (457.0) (23809):On LogSock for host calendula.iac.es:
-> [pvmd pid20947] 
12/3 17:06:14 (457.0) (23809):On LogSock for host calendula.iac.es:
-> 12/03 17:06:38 version 3.4.2
-> [pvmd pid20947] 12/03 17:06:38 ddpro 2316 tdpro 1318
-> [pvmd pid20947] 12/03 17:06:38 main() debug mask is 0x11c (tsk,slv,hst,sch)
12/3 17:06:14 (457.0) (23809):In cancel_timer()
12/3 17:06:14 (457.0) (23809):Timer List
12/3 17:06:14 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:14 (457.0) (23809):Received PVM info from calendula.iac.es
12/3 17:06:14 (457.0) (23809):Adding host calendula.iac.es to STARTACK msg.
12/3 17:06:14 (457.0) (23809):Num Hosts to pack = 1
12/3 17:06:14 (457.0) (23809):Packing tid tc0000 with reply ddpro<2316> arch<LINUX> ip<a1485193:80d5> mtu<4080> dsig<4229185>
12/3 17:06:14 (457.0) (23809):Sending SM_STHOSTACK to PVMd
12/3 17:06:14 (457.0) (23809):PVMd message is SM_ADDACK from t80000000
12/3 17:06:14 (457.0) (23809):Host #1(calendula.iac.es) has been added to PVM, pvmd_tid = 800c0000
12/3 17:06:14 (457.0) (23809):SendNotification(kind = 3, tid = t800c0000)
12/3 17:06:14 (457.0) (23809):pvm_machines_starting = 0(should be 0)
12/3 17:06:14 (457.0) (23809):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:06:14 (457.0) (23809):open_max = 1024
12/3 17:06:14 (457.0) (23809):Local PVM process pid = 23872
12/3 17:06:14 (457.0) (23809):Entered StartWaitingHosts()
12/3 17:06:14 (457.0) (23809):Ok to start waiting hosts
12/3 17:06:14 (457.0) (23809):PVMd message is SM_EXECACK from t80000000
12/3 17:06:14 (457.0) (23809):Setting local tid to t40002
12/3 17:06:14 (457.0) (23809):PVMd message is SM_CONFIG from t40002
12/3 17:06:14 (457.0) (23809):PVMd message is SM_SPAWN from t40002
12/3 17:06:14 (457.0) (23809):ERROR "Assertion ERROR on (count == 1)" at line 591 in file pvm_emulation.C
12/3 17:06:14 (457.0) (23809):Multi_Shadow: Shutting down...
12/3 17:06:14 (457.0) (23809):Updated class:
12/3 17:06:14 (457.0) (23809):#0: 0 (1, 2) has 1
12/3 17:06:14 (457.0) (23809):signal_startd( calendula.iac.es, 443 )
12/3 17:06:14 (457.0) (23809):in new_timer()
12/3 17:06:14 (457.0) (23809):Timer List
12/3 17:06:14 (457.0) (23809):^^^^^ ^^^^
12/3 17:06:14 (457.0) (23809):id = 1, when = 300
12/3 17:06:15 (?.?) (23873):********** Multi_Shadow starting up **********
12/3 17:06:15 (?.?) (23873):uid=0, euid=2120, gid=0, egid=20
12/3 17:06:15 (?.?) (23873):My_Filesystem_Domain = "iac.es"
12/3 17:06:15 (?.?) (23873):My_UID_Domain = "iac.es"
12/3 17:06:15 (?.?) (23873):Shadow reading via ASCII
12/3 17:06:15 (?.?) (23873):First Line: 457 0 1
12/3 17:06:15 (457.0) (23873):Created class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 0
12/3 17:06:15 (457.0) (23873):New process for proc 0
12/3 17:06:15 (457.0) (23873):AllocProc() returning 0
12/3 17:06:15 (457.0) (23873):Machine from schedd: <161.72.81.147:32804> <161.72.81.147:32804>#1709993205 0
12/3 17:06:15 (457.0) (23873):Machine Line: calendula.iac.es 0
12/3 17:06:15 (457.0) (23873):Machines now cur = 1 desire = 2
12/3 17:06:15 (457.0) (23873):Updated class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 1
12/3 17:06:15 (457.0) (23873):Starting pvmd: /usr/pkg/condor/condor/sbin/condor_pvmd -d0x11c
12/3 17:06:15 (457.0) (23873):PVM is pid 23874
12/3 17:06:15 (457.0) (23873):pvmd response: /tmp/fileFJ8XV4
12/3 17:06:15 (457.0) (23873):PVMSOCK=/tmp/fileFJ8XV4
12/3 17:06:15 (457.0) (23873):pvm_fd = 4, mytid = t40001
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Ok to start waiting hosts
12/3 17:06:15 (457.0) (23873):PVMd message is SM_STHOST from t80000000
12/3 17:06:15 (457.0) (23873):SM_STHOST: 80000 "" "161.72.81.147" "$PVM_ROOT/lib/pvmd -s -d0x11c -ncalendula.iac.es 1 a14851bb:8045 4080 2 a1485193:0000"
12/3 17:06:15 (457.0) (23873):New process for proc 0
12/3 17:06:15 (457.0) (23873):AllocProc() returning 0
12/3 17:06:15 (457.0) (23873):Shadow: Entering multi_send_job(calendula.iac.es)
12/3 17:06:15 (457.0) (23873):Requesting Alternate Starter 1
12/3 17:06:15 (457.0) (23873):Shadow: Request to run a job was REFUSED
12/3 17:06:15 (457.0) (23873):RemoveHost: Sending HostDelete notify on t80080000
12/3 17:06:15 (457.0) (23873):SendNotification(kind = 2, tid = t80080000)
12/3 17:06:15 (457.0) (23873):signal_startd( calendula.iac.es, 443 )
12/3 17:06:15 (457.0) (23873):Adding host calendula.iac.es to STARTACK msg.
12/3 17:06:15 (457.0) (23873):Num Hosts to pack = 1
12/3 17:06:15 (457.0) (23873):Packing tid t80000 with reply PvmNoHost
12/3 17:06:15 (457.0) (23873):Sending SM_STHOSTACK to PVMd
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Can't start new machines now{ calendula.iac.es}
12/3 17:06:15 (457.0) (23873):Updated class:
12/3 17:06:15 (457.0) (23873):#0: 0 (1, 2) has 0
12/3 17:06:15 (457.0) (23873):PVMd message is SM_ADDACK from t80000000
12/3 17:06:15 (457.0) (23873):pvmd reports error -6 on SM_ADDACK: PvmNoHost
12/3 17:06:15 (457.0) (23873):pvm_machines_starting = 0(should be 0)
12/3 17:06:15 (457.0) (23873):StartLocalProcess: = /home/angelv/SCRIPTS/CONDOR/PVM/master1 < /dev/null > out.dat >& err.dat
12/3 17:06:15 (457.0) (23873):Not enough machines in class 0 to start local proc.
12/3 17:06:15 (457.0) (23873):Entered StartWaitingHosts()
12/3 17:06:15 (457.0) (23873):Ok to start waiting hosts

-- 
----------------------------------
http://www.iac.es/galeria/angelv/

PostDoc Software Support
Instituto de Astrofisica de Canarias