[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with PVM example



Has anyone managed to get the PVM example to work in condor-6.7.17?
Other examples I ran with no problems but as with the PVM example, I
could only get the job to be submitted and executed but with empty
output.

I have spent some time analyzing the problems but I am now running out
of ideas. I would very much appreciate any help. Please refer to the
below log files for analysis. Thanks.

regards,
Amril


Here's the log file generated:

000 (021.000.000) 03/10 15:15:39 Job submitted from host:
<128.16.3.12:37923>
...
005 (021.000.000) 03/10 15:15:41 Job terminated.
	(1) Normal termination (return value 110)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

And here's the StartLog and ShadowLog:

StartLog

3/9 21:23:07 DaemonCore: Command received via TCP from host
<128.16.3.12:53264>
3/9 21:23:07 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
3/9 21:23:07 vm1: Got activate_claim request from shadow
(<128.16.3.12:53264>)
3/9 21:23:07 vm1: Remote job ID is 19.0
3/9 21:23:07 vm1: exec_starter( supaman.cs.ucl.ac.uk, 10, 11 ) : pid
21210
3/9 21:23:07 vm1:
execl(/home/condor/condor-6.7.17/sbin/condor_starter.pvm,
"condor_starter", supaman.cs.ucl.ac.uk, 0)
3/9 21:23:07 vm1: Got universe "PVM" (4) from request classad
3/9 21:23:07 vm1: State change: claim-activation protocol successful
3/9 21:23:07 vm1: Changing activity: Idle -> Busy
3/9 21:23:07 DaemonCore: Command received via TCP from host
<128.16.3.12:55689>
3/9 21:23:07 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/9 21:23:07 vm1: State change: received RELEASE_CLAIM command
3/9 21:23:07 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
3/9 21:23:07 Starter pid 21210 exited with status 0
3/9 21:23:07 vm1: State change: starter exited
3/9 21:23:07 vm1: State change: No preempting claim, returning to owner
3/9 21:23:07 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
3/9 21:23:07 vm1: State change: IS_OWNER is false
3/9 21:23:07 vm1: Changing state: Owner -> Unclaimed
3/9 21:23:07 DaemonCore: Command received via TCP from host
<128.16.3.12:56957>
3/9 21:23:07 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/9 21:23:07 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#28)
3/9 21:23:07 condor_write(): Socket closed when trying to write buffer,
fd is 5
3/9 21:23:07 Buf::write(): condor_write() failed
3/9 21:23:07 DaemonCore: Command received via UDP from host
<128.16.3.12:33045>
3/9 21:23:07 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/9 21:23:07 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#28)
3/9 22:10:28 DaemonCore: Command received via TCP from host
<128.16.3.12:33462>
3/9 22:10:28 DaemonCore: received command 448 (GIVE_STATE), calling
handler (command_give_state)
3/10 01:11:12 State change: RunBenchmarks is TRUE
3/10 01:11:12 vm2: Changing activity: Idle -> Benchmarking
3/10 01:11:17 State change: benchmarks completed
3/10 01:11:17 vm2: Changing activity: Benchmarking -> Idle
3/10 05:11:17 State change: RunBenchmarks is TRUE
3/10 05:11:17 vm2: Changing activity: Idle -> Benchmarking
3/10 05:11:22 State change: benchmarks completed
3/10 05:11:22 vm2: Changing activity: Benchmarking -> Idle
3/10 09:11:22 State change: RunBenchmarks is TRUE
3/10 09:11:22 vm2: Changing activity: Idle -> Benchmarking
3/10 09:11:26 State change: benchmarks completed
3/10 09:11:26 vm2: Changing activity: Benchmarking -> Idle
3/10 12:48:30 DaemonCore: Command received via UDP from host
<128.16.3.12:33077>
3/10 12:48:30 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
3/10 12:48:30 vm1: match_info called
3/10 12:48:30 vm1: Received match <128.16.3.12:36420>#1141852229#30
3/10 12:48:30 vm1: State change: match notification protocol successful
3/10 12:48:30 vm1: Changing state: Unclaimed -> Matched
3/10 12:48:30 DaemonCore: Command received via TCP from host
<128.16.3.12:36327>
3/10 12:48:30 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
3/10 12:48:30 vm1: Request accepted.
3/10 12:48:30 vm1: Remote owner is condor@xxxxxxxxxxx
3/10 12:48:30 vm1: State change: claiming protocol successful
3/10 12:48:30 vm1: Changing state: Matched -> Claimed
3/10 12:48:32 DaemonCore: Command received via TCP from host
<128.16.3.12:47860>
3/10 12:48:32 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
3/10 12:48:32 vm1: Got activate_claim request from shadow
(<128.16.3.12:47860>)
3/10 12:48:32 vm1: Remote job ID is 20.0
3/10 12:48:32 vm1: exec_starter( supaman.cs.ucl.ac.uk, 10, 11 ) : pid
29989
3/10 12:48:32 vm1:
execl(/home/condor/condor-6.7.17/sbin/condor_starter.pvm,
"condor_starter", supaman.cs.ucl.ac.uk, 0)
3/10 12:48:32 vm1: Got universe "PVM" (4) from request classad
3/10 12:48:32 vm1: State change: claim-activation protocol successful
3/10 12:48:32 vm1: Changing activity: Idle -> Busy
3/10 12:48:32 DaemonCore: Command received via TCP from host
<128.16.3.12:53932>
3/10 12:48:32 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 12:48:32 vm1: State change: received RELEASE_CLAIM command
3/10 12:48:32 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
3/10 12:48:32 Starter pid 29989 exited with status 0
3/10 12:48:32 vm1: State change: starter exited
3/10 12:48:32 vm1: State change: No preempting claim, returning to owner
3/10 12:48:32 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
3/10 12:48:32 vm1: State change: IS_OWNER is false
3/10 12:48:32 vm1: Changing state: Owner -> Unclaimed
3/10 12:48:32 DaemonCore: Command received via TCP from host
<128.16.3.12:60303>
3/10 12:48:32 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 12:48:32 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#30)
3/10 12:48:32 condor_write(): Socket closed when trying to write buffer,
fd is 5
3/10 12:48:32 Buf::write(): condor_write() failed
3/10 12:48:32 DaemonCore: Command received via UDP from host
<128.16.3.12:33083>
3/10 12:48:32 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 12:48:32 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#30)
3/10 13:11:27 State change: RunBenchmarks is TRUE
3/10 13:11:27 vm2: Changing activity: Idle -> Benchmarking
3/10 13:11:32 State change: benchmarks completed
3/10 13:11:32 vm2: Changing activity: Benchmarking -> Idle
3/10 15:15:39 DaemonCore: Command received via UDP from host
<128.16.3.12:33109>
3/10 15:15:39 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
3/10 15:15:39 vm1: match_info called
3/10 15:15:39 vm1: Received match <128.16.3.12:36420>#1141852229#32
3/10 15:15:39 vm1: State change: match notification protocol successful
3/10 15:15:39 vm1: Changing state: Unclaimed -> Matched
3/10 15:15:39 DaemonCore: Command received via TCP from host
<128.16.3.12:60434>
3/10 15:15:39 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
3/10 15:15:39 vm1: Request accepted.
3/10 15:15:39 vm1: Remote owner is condor@xxxxxxxxxxx
3/10 15:15:39 vm1: State change: claiming protocol successful
3/10 15:15:39 vm1: Changing state: Matched -> Claimed
3/10 15:15:41 DaemonCore: Command received via TCP from host
<128.16.3.12:37745>
3/10 15:15:41 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
3/10 15:15:41 vm1: Got activate_claim request from shadow
(<128.16.3.12:37745>)
3/10 15:15:41 vm1: Remote job ID is 21.0
3/10 15:15:41 vm1: exec_starter( supaman.cs.ucl.ac.uk, 10, 11 ) : pid
31572
3/10 15:15:41 vm1:
execl(/home/condor/condor-6.7.17/sbin/condor_starter.pvm,
"condor_starter", supaman.cs.ucl.ac.uk, 0)
3/10 15:15:41 vm1: Got universe "PVM" (4) from request classad
3/10 15:15:41 vm1: State change: claim-activation protocol successful
3/10 15:15:41 vm1: Changing activity: Idle -> Busy
3/10 15:15:41 DaemonCore: Command received via TCP from host
<128.16.3.12:36707>
3/10 15:15:41 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 15:15:41 vm1: State change: received RELEASE_CLAIM command
3/10 15:15:41 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
3/10 15:15:41 Starter pid 31572 exited with status 0
3/10 15:15:41 vm1: State change: starter exited
3/10 15:15:41 vm1: State change: No preempting claim, returning to owner
3/10 15:15:41 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
3/10 15:15:41 vm1: State change: IS_OWNER is false
3/10 15:15:41 vm1: Changing state: Owner -> Unclaimed
3/10 15:15:41 DaemonCore: Command received via TCP from host
<128.16.3.12:59944>
3/10 15:15:41 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 15:15:41 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#32)
3/10 15:15:41 condor_write(): Socket closed when trying to write buffer,
fd is 5
3/10 15:15:41 Buf::write(): condor_write() failed
3/10 15:15:41 DaemonCore: Command received via UDP from host
<128.16.3.12:33113>
3/10 15:15:41 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
3/10 15:15:41 Warning: can't find resource with ClaimId
(<128.16.3.12:36420>#1141852229#32)

3/10 15:15:41 (21.0) (31570):Starting
pvmd: /home/condor/condor-6.7.17/sbin/condor_pvmd -d0x11c
3/10 15:15:41 (21.0) (31570):PVM is pid 31571
3/10 15:15:41 (21.0) (31570):pvmd response: /tmp/fileL3SMMJ
3/10 15:15:41 (21.0) (31570):PVMSOCK=/tmp/fileL3SMMJ
3/10 15:15:41 (21.0) (31570):pvm_fd = 2, mytid = t40001
3/10 15:15:41 (21.0) (31570):Entered StartWaitingHosts()
3/10 15:15:41 (21.0) (31570):Ok to start waiting hosts
3/10 15:15:41 (21.0) (31570):PVMd message is SM_STHOST from t80000000
3/10 15:15:41 (21.0) (31570):SM_STHOST: 80000 "" "128.16.3.12"
"$PVM_ROOT/lib/pvmd -s -d0x11c -nvm1@xxxxxxxxxxxxxxxxxxxx 1
8010030c:8155 4080 2 8010030c:0000"
3/10 15:15:41 (21.0) (31570):New process for proc 0
3/10 15:15:41 (21.0) (31570):AllocProc() returning 0
3/10 15:15:41 (21.0) (31570):Shadow: Entering
multi_send_job(vm1@xxxxxxxxxxxxxxxxxxxx)
3/10 15:15:41 (21.0) (31570):Requesting Alternate Starter 1
3/10 15:15:41 (21.0) (31570):Shadow: Request to run a job was ACCEPTED
3/10 15:15:41 (21.0) (31570):Shadow: RSC_SOCK connected, fd = 4
3/10 15:15:41 (21.0) (31570):Multi_Shadow: CLIENT_LOG connected, fd = 5
3/10 15:15:41 (21.0) (31570):in new_timer()
3/10 15:15:41 (21.0) (31570):Timer List
3/10 15:15:41 (21.0) (31570):^^^^^ ^^^^
3/10 15:15:41 (21.0) (31570):id = 0, when = 180
3/10 15:15:41 (21.0) (31570):Shadow: send_pvm_job_info
3/10 15:15:41 (21.0) (31570):send_pvm_job_info: arg = -s -d0x11c
-nvm1@xxxxxxxxxxxxxxxxxxxx 1 8010030c:8155 4080 2 8010030c:0000 -f
3/10 15:15:41 (21.0) (31570):On LogSock for host
vm1@xxxxxxxxxxxxxxxxxxxx:
-> [pvmd pid31573] 
3/10 15:15:41 (21.0) (31570):On LogSock for host
vm1@xxxxxxxxxxxxxxxxxxxx:
-> 03/10 15:15:41 version 3.4.2
-> [pvmd pid31573] 
3/10 15:15:41 (21.0) (31570):On LogSock for host
vm1@xxxxxxxxxxxxxxxxxxxx:
-> 03/10 15:15:41 ddpro 2316 tdpro 1318
-> [pvmd pid31573] 03/10 15:15:41 main() debug mask is 0x11c
(tsk,slv,hst,sch)
3/10 15:15:41 (21.0) (31570):In cancel_timer()
3/10 15:15:41 (21.0) (31570):Timer List
3/10 15:15:41 (21.0) (31570):^^^^^ ^^^^
3/10 15:15:41 (21.0) (31570):Received PVM info from
vm1@xxxxxxxxxxxxxxxxxxxx
3/10 15:15:41 (21.0) (31570):Adding host vm1@xxxxxxxxxxxxxxxxxxxx to
STARTACK msg.
3/10 15:15:41 (21.0) (31570):Num Hosts to pack = 1
3/10 15:15:41 (21.0) (31570):Packing tid t80000 with reply ddpro<2316>
arch<LINUX> ip<8010030c:8157> mtu<4080> dsig<4229185>
3/10 15:15:41 (21.0) (31570):Sending SM_STHOSTACK to PVMd
3/10 15:15:41 (21.0) (31570):PVMd message is SM_ADDACK from t80000000
3/10 15:15:41 (21.0) (31570):Host #0(vm1@xxxxxxxxxxxxxxxxxxxx) has been
added to PVM, pvmd_tid = 80080000
3/10 15:15:41 (21.0) (31570):SendNotification(kind = 3, tid = t80080000)
3/10 15:15:41 (21.0) (31570):pvm_machines_starting = 0(should be 0)
3/10 15:15:41 (21.0) (31570):StartLocalProcess:
= /home/condor/condor-6.7.17/examples/PVM/master_sum < in_sum > out_sum
>& err_sum
3/10 15:15:41 (21.0) (31570):open_max = 1024
3/10 15:15:41 (21.0) (31570):Local PVM process pid = 31574
3/10 15:15:41 (21.0) (31570):Entered StartWaitingHosts()
3/10 15:15:41 (21.0) (31570):Ok to start waiting hosts
3/10 15:15:41 (21.0) (31570):deadpid = 31574
3/10 15:15:41 (21.0) (31570):Local process for job 21.0 died with status
0x6e00
3/10 15:15:41 (21.0) (31570):SendNotification(kind = 1, tid = t0)
3/10 15:15:41 (21.0) (31570):Multi_Shadow: Shutting down...
3/10 15:15:41 (21.0) (31570):Updated class:
3/10 15:15:41 (21.0) (31570):#0: 0 (1, 1) has 1
3/10 15:15:41 (21.0) (31570):signal_startd( vm1@xxxxxxxxxxxxxxxxxxxx,
443 )
3/10 15:15:41 (21.0) (31570):in new_timer()
3/10 15:15:41 (21.0) (31570):Timer List
3/10 15:15:41 (21.0) (31570):^^^^^ ^^^^
3/10 15:15:41 (21.0) (31570):id = 1, when = 300
3/10 15:15:41 (21.0) (31570):deadpid = 0
3/10 15:15:41 (21.0) (31570):No more dead processes(errno = 115)
3/10 15:15:41 (21.0) (31570):deadpid = 31571
3/10 15:15:41 (21.0) (31570):Lost local pvmd termsig = 9, retcode = 0
3/10 15:15:41 (21.0) (31570):deadpid = -1
3/10 15:15:41 (21.0) (31570):No more dead processes(errno = 10)
3/10 15:15:41 (21.0) (31570):Subproc 32767 exited, termsig = 0, coredump
= 0, retcode = 15
3/10 15:15:41 (21.0) (31570):ru_utime = 0.000000
3/10 15:15:41 (21.0) (31570):ru_stime = 0.000000
3/10 15:15:41 (21.0) (31570):IO: Failed to read packet header
3/10 15:15:41 (21.0) (31570):Failed to get syscall_code for proc 0
removing..
3/10 15:15:41 (21.0) (31570):Removing Proc 0(t0) from Job
3/10 15:15:41 (21.0) (31570):remove starter for host
vm1@xxxxxxxxxxxxxxxxxxxx, removing the host too
3/10 15:15:41 (21.0) (31570):RemoveHost: Sending HostDelete notify on
t80080000
3/10 15:15:41 (21.0) (31570):SendNotification(kind = 2, tid = t80080000)
3/10 15:15:41 (21.0) (31570):signal_startd( vm1@xxxxxxxxxxxxxxxxxxxx,
443 )
3/10 15:15:41 (21.0) (31570):Updated class:
3/10 15:15:41 (21.0) (31570):#0: 0 (1, 1) has 0
3/10 15:15:41 (21.0) (31570):Trying to
unlink /home/condor/condor-6.7.17/local.supaman/spool/cluster21.proc0.subproc0
3/10 15:15:41 (21.0) (31570):All processes completed, job should be
deleted
3/10 15:15:41 (21.0) (31570):MultiShadow Exiting!!!
3/10 15:15:41 (21.0) (31570):user_time = 1 ticks
3/10 15:15:41 (21.0) (31570):sys_time = 2 ticks
3/10 15:15:41 (21.0) (31570):Entering multi_update_job_status()
3/10 15:15:41 (21.0) (31570):Shadow: marked job status "COMPLETED"
3/10 15:15:41 (21.0) (31570):multi_update_job_status() returns 0
3/10 15:15:41 (21.0) (31570):Shadow: Job exited normally with status 110
3/10 15:15:41 (21.0) (31570):Notification = "
exited with status 110, and touched 1 machines.
Start-up was unsuccessful on 0 machines."
3/10 15:15:41 (21.0) (31570):********** Shadow Parent Exiting(100)
**********