[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Local PVM process dies with status 0x6e00



Hi everyone,
I am having this problem and can't figure out how to solve it. When using the PVM Universe, the shadow process at the submitting machine always dies with status 0x6e00 before any execution of the task on the matched machine starts..This results in the matched machine not executing the task..I have been searching through the achieve but can't get any answers.. Attached are my log files.

1). ShadowLog of the submitting machine

6/14 08:06:45 (?.?) (19812):********** Multi_Shadow starting up **********
6/14 08:06:45 (?.?) (19812):uid=0, euid=500, gid=0, egid=500
6/14 08:06:45 (?.?) (19812):My_Filesystem_Domain = "your.domain"
6/14 08:06:45 (?.?) (19812):My_UID_Domain = "your.domain"
6/14 08:06:45 (?.?) (19812):Shadow reading via ASCII
6/14 08:06:45 (?.?) (19812):First Line: 28 0 1
6/14 08:06:45 (28.0) (19812):Created class:
6/14 08:06:45 (28.0) (19812):#0: 0 (1, 1) has 0
6/14 08:06:45 (28.0) (19812):New process for proc 0
6/14 08:06:45 (28.0) (19812):AllocProc() returning 0
6/14 08:06:45 (28.0) (19812):Machine from schedd: <192.168.1.103:32973> <192.168.1.103:32973>#1150182634#38 0
6/14 08:06:45 (28.0) (19812):Machine Line: wolf3 0
6/14 08:06:45 (28.0) (19812):Machines now cur = 1 desire = 1
6/14 08:06:45 (28.0) (19812):Updated class:
6/14 08:06:45 (28.0) (19812):#0: 0 (1, 1) has 1
6/14 08:06:45 (28.0) (19812):Starting pvmd: /home/condor/condor-install/sbin/condor_pvmd -d0x11c
6/14 08:06:45 (28.0) (19812):PVM is pid 19813
6/14 08:06:45 (28.0) (19812):pvmd response: /tmp/fileGDivVH
6/14 08:06:45 (28.0) (19812):PVMSOCK=/tmp/fileGDivVH
6/14 08:06:45 (28.0) (19812):pvm_fd = 4, mytid = t40001
6/14 08:06:45 (28.0) (19812):Entered StartWaitingHosts()
6/14 08:06:45 (28.0) (19812):Ok to start waiting hosts
6/14 08:06:45 (28.0) (19812):PVMd message is SM_STHOST from t80000000
6/14 08:06:45 (28.0) (19812):SM_STHOST: 80000 "" "192.168.1.103" "$PVM_ROOT/lib/pvmd -s -d0x11c -nwolf3 1 c0a80166:8eb4 4080 2 c0a80167:0000"
6/14 08:06:45 (28.0) (19812):New process for proc 0
6/14 08:06:45 (28.0) (19812):AllocProc() returning 0
6/14 08:06:45 (28.0) (19812):Shadow: Entering multi_send_job(wolf3)
6/14 08:06:45 (28.0) (19812):Requesting Alternate Starter 1
6/14 08:06:45 (28.0) (19812):Shadow: Request to run a job was ACCEPTED
6/14 08:06:45 (28.0) (19812):Shadow: RSC_SOCK connected, fd = 6
6/14 08:06:45 (28.0) (19812):Multi_Shadow: CLIENT_LOG connected, fd = 7
6/14 08:06:45 (28.0) (19812):in new_timer()
6/14 08:06:45 (28.0) (19812):Timer List
6/14 08:06:45 (28.0) (19812):^^^^^ ^^^^
6/14 08:06:45 (28.0) (19812):id = 0, when = 180
6/14 08:06:45 (28.0) (19812):Shadow: send_pvm_job_info
6/14 08:06:45 (28.0) (19812):send_pvm_job_info: arg = -s -d0x11c -nwolf3 1 c0a80166:8eb4 4080 2 c0a80167:0000 -f
6/14 08:06:45 (28.0) (19812):On LogSock for host wolf3:
-> [pvmd pid17803] 06/14 08:07:36 version 3.4.2
6/14 08:06:45 (28.0) (19812):On LogSock for host wolf3:
-> [pvmd pid17803] 06/14 08:07:36 ddpro 2316 tdpro 1318
-> [pvmd pid17803] 06/14 08:07:36 main() debug mask is 0x11c (tsk,slv,hst,sch)
6/14 08:06:45 (28.0) (19812):In cancel_timer()
6/14 08:06:45 (28.0) (19812):Timer List
6/14 08:06:45 (28.0) (19812):^^^^^ ^^^^
6/14 08:06:45 (28.0) (19812):Received PVM info from wolf3
6/14 08:06:45 (28.0) (19812):Adding host wolf3 to STARTACK msg.
6/14 08:06:45 (28.0) (19812):Num Hosts to pack = 1
6/14 08:06:45 (28.0) (19812):Packing tid t80000 with reply ddpro<2316> arch<LINUX> ip<c0a80167:8423> mtu<4080> dsig<4229185>
6/14 08:06:45 (28.0) (19812):Sending SM_STHOSTACK to PVMd
6/14 08:06:45 (28.0) (19812):PVMd message is SM_ADDACK from t80000000
6/14 08:06:45 (28.0) (19812):Host #0(wolf3) has been added to PVM, pvmd_tid = 80080000
6/14 08:06:45 (28.0) (19812):SendNotification(kind = 3, tid = t80080000)
6/14 08:06:45 (28.0) (19812):pvm_machines_starting = 0(should be 0)
6/14 08:06:45 (28.0) (19812):StartLocalProcess: = /home/condor/examples/PVM/master_sum < in_sum > out_sum >& err_sum
6/14 08:06:45 (28.0) (19812):open_max = 1024
6/14 08:06:45 (28.0) (19812):Local PVM process pid = 19814
6/14 08:06:45 (28.0) (19812):Entered StartWaitingHosts()
6/14 08:06:45 (28.0) (19812):Ok to start waiting hosts
6/14 08:06:45 (28.0) (19812):deadpid = 19814
6/14 08:06:45 (28.0) (19812):Local process for job 28.0 died with status 0x6e00
6/14 08:06:45 (28.0) (19812):SendNotification(kind = 1, tid = t0)
6/14 08:06:45 (28.0) (19812):Multi_Shadow: Shutting down...
6/14 08:06:45 (28.0) (19812):Updated class:
6/14 08:06:45 (28.0) (19812):#0: 0 (1, 1) has 1
6/14 08:06:45 (28.0) (19812):signal_startd( wolf3, 443 )
6/14 08:06:45 (28.0) (19812):in new_timer()
6/14 08:06:45 (28.0) (19812):Timer List
6/14 08:06:45 (28.0) (19812):^^^^^ ^^^^
6/14 08:06:45 (28.0) (19812):id = 1, when = 300
6/14 08:06:45 (28.0) (19812):deadpid = 19813
6/14 08:06:45 (28.0) (19812):Lost local pvmd termsig = 9, retcode = 0
6/14 08:06:45 (28.0) (19812):deadpid = -1
6/14 08:06:45 (28.0) (19812):No more dead processes(errno = 10)
6/14 08:06:45 (28.0) (19812):deadpid = -1
6/14 08:06:45 (28.0) (19812):No more dead processes(errno = 10)
6/14 08:06:45 (28.0) (19812):Subproc 32767 exited, termsig = 0, coredump = 0, retcode = 15
6/14 08:06:45 (28.0) (19812):ru_utime = 0.000000
6/14 08:06:45 (28.0) (19812):ru_stime = 0.000000
6/14 08:06:45 (28.0) (19812):IO: Failed to read packet header
6/14 08:06:45 (28.0) (19812):Failed to get syscall_code for proc 0 removing..
6/14 08:06:45 (28.0) (19812):Removing Proc 0(t0) from Job
6/14 08:06:45 (28.0) (19812):remove starter for host wolf3, removing the host too
6/14 08:06:45 (28.0) (19812):RemoveHost: Sending HostDelete notify on t80080000
6/14 08:06:45 (28.0) (19812):SendNotification(kind = 2, tid = t80080000)
6/14 08:06:45 (28.0) (19812):signal_startd( wolf3, 443 )
6/14 08:06:45 (28.0) (19812):Updated class:
6/14 08:06:45 (28.0) (19812):#0: 0 (1, 1) has 0
6/14 08:06:45 (28.0) (19812):Trying to unlink /home/condor/condor-install/local.wolf2/spool/cluster28.proc0.subproc0
6/14 08:06:45 (28.0) (19812):All processes completed, job should be deleted
6/14 08:06:45 (28.0) (19812):MultiShadow Exiting!!!
6/14 08:06:45 (28.0) (19812):user_time = 6 ticks
6/14 08:06:45 (28.0) (19812):sys_time = 6 ticks
6/14 08:06:45 (28.0) (19812):Entering multi_update_job_status()
6/14 08:06:45 (28.0) (19812):Shadow: marked job status "COMPLETED"
6/14 08:06:45 (28.0) (19812):multi_update_job_status() returns 0
6/14 08:06:45 (28.0) (19812):Shadow: Job exited normally with status 110
6/14 08:06:45 (28.0) (19812):Notification = "
exited with status 110, and touched 1 machines.
Start-up was unsuccessful on 0 machines."
6/14 08:06:45 (28.0) (19812):********** Shadow Parent Exiting(100) **********

2). SchedLog of submitting machine

6/14 08:06:42 (pid:4883) IO: Failed to read packet header
6/14 08:06:42 (pid:4883) DaemonCore: Command received via UDP from host <192.168.1.102:36529>
6/14 08:06:42 (pid:4883) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
6/14 08:06:42 (pid:4883) Sent ad to central manager for condor@xxxxxxxxxxx
6/14 08:06:42 (pid:4883) Sent ad to 1 collectors for condor@xxxxxxxxxxx
6/14 08:06:42 (pid:4883) Called reschedule_negotiator()
6/14 08:06:42 (pid:4883) Activity on stashed negotiator socket
6/14 08:06:42 (pid:4883) Negotiating for owner: condor@xxxxxxxxxxx
6/14 08:06:42 (pid:4883) Checking consistency running and runnable jobs
6/14 08:06:42 (pid:4883) Tables are consistent
6/14 08:06:42 (pid:4883) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
6/14 08:06:44 (pid:4883) About to Create_Process( /home/condor/condor-install/sbin/condor_shadow.pvm, condor_shadow.pvm <192.168.1.102:33041>, ... )
6/14 08:06:44 (pid:4883) In parent, shadow pid = 19812
6/14 08:06:44 (pid:4883) Starting add_shadow_birthdate(28.0)
6/14 08:06:44 (pid:4883) shadow_fd = 12
6/14 08:06:44 (pid:4883) Sending job 28.0 to shadow pid 19812
6/14 08:06:44 (pid:4883) First Line: 28 0 1
6/14 08:06:44 (pid:4883) sending <192.168.1.103:32973> <192.168.1.103:32973>#1150182634#38 0 wolf3
6/14 08:06:45 (pid:4883) IO: Failed to read packet header
6/14 08:06:45 (pid:4883) IO: Failed to read packet header
6/14 08:06:45 (pid:4883) IO: Failed to read packet header
6/14 08:06:45 (pid:4883) IO: Failed to read packet header
6/14 08:06:45 (pid:4883) IO: Failed to read packet header
6/14 08:06:45 (pid:4883) DaemonCore: Command received via TCP from host <192.168.1.103:33082>
6/14 08:06:45 (pid:4883) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
6/14 08:06:45 (pid:4883) Got VACATE_SERVICE from <192.168.1.103:33082>
6/14 08:06:45 (pid:4883) Sent RELEASE_CLAIM to startd on <192.168.1.103:32973>
6/14 08:06:45 (pid:4883) Match record (<192.168.1.103:32973>, 28, 0) deleted
6/14 08:06:45 (pid:4883) Shadow pid 19812 for job 28.0 exited with status 100
6/14 08:06:47 (pid:4883) Sent owner (0 jobs) ad to 1 collectors
6/14 08:06:48 (pid:4883) IO: Failed to read packet header
6/14 09:05:28 (pid:4883) Cleaning job queue...

It seems at some point the Shadow process dies and in turn the matched machine is released before any execution takes place..
Please look through and help..