[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception! and Create_Process failed to register the job with the ProcD"



Hi ,

I have a condor pool with RHEL4.0_x86_64[Central Manger], IBM, NT,RHEL5, Windows-Vista , HP ,(all clients)  when ever i execute a job on ibm /or another machine it is put to idle .,
and i have the following error's in my log.,

When ever my executable gives an error is transferred back as an output .,
but when i submit it is in idle state and never completes

i am new to condor, can any one please help.,

here are the log that's written.,

condor_master's
log

SCHEDD.log
2442 4/18 19:38:52 (pid:5847) Shadow pid 11902 for job 93.0 exited with status 4
2443 4/18 19:38:52 (pid:5847) ERROR: Shadow exited with job exception code!

Shadow.log
4/16 21:10:54 DaemonCore: Command Socket at <10.20.3.180:37120>
4/16 21:10:54 Initializing a VANILLA shadow for job 43.0
4/16 21:10:54 (43.0) (3785): Request to run on <10.20.4.30:34497> was ACCEPTED
4/16 21:10:56 (43.0) (3785): ERROR "Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxx: Create_Process failed to register the job with the ProcD" at line 649 in file pseudo_ops.C


4/18 19:43:52 DaemonCore: Command Socket at <10.20.3.180:36642>
4/18 19:43:52 Initializing a VANILLA shadow for job 92.0
4/18 19:43:52 (92.0) (12020): Request to run on <10.20.4.30:34497> was ACCEPTED
4/18 19:43:53 (92.0) (12020): ERROR "Can no longer talk to condor_starter <10.20.4.30:34497>" at line 121 in file NTreceivers.C

Master.log
4/18 09:53:06 DaemonCore: Command Socket at <10.20.3.180:32825>
4/18 09:53:06 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/18 09:53:06 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/18 09:53:06 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_collector", pid and pgroup = 5841
4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_negotiator", pid and pgroup = 5846
4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_schedd", pid and pgroup = 5847
4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_startd", pid and pgroup = 5851
4/18 10:53:09 Preen pid is 7611
4/18 10:53:09 Child 7611 died, but not a daemon -- Ignored

Job log file :
4343 007 (092.000.000) 04/18 19:53:52 Shadow exception!
4344     Can no longer talk to condor_starter <10.20.4.30:34497>
4345     0  -  Run Bytes Sent By Job
4346     0  -  Run Bytes Received By Job


condor_client's log
StartLog
4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD
4/18 19:59:35 slot2: ERROR: exec_starter failed!
4/18 19:59:35 slot2: ERROR: exec_starter returned 0
4/18 19:59:35 slot2: Got activate_claim request from shadow (<10.20.3.180:36746>)
4/18 19:59:35 slot2: Remote job ID is 94.0
4/18 19:59:35 mkfifo of /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 error: No such file or directory (2)
4/18 19:59:35 failed to initialize named pipe at /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0
4/18 19:59:35 LocalClient: error initializing NamedPipeReader
4/18 19:59:35 ProcFamilyClient: failed to start connection with ProcD
4/18 19:59:35 register_subfamily: ProcD communication error
4/18 19:59:35 Create_Process: error registering family for pid 372916
4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD
4/18 19:59:35 slot2: ERROR: exec_starter failed!
4/18 19:59:35 slot2: ERROR: exec_starter returned 0
4/18 19:59:35 slot2: Got activate_claim request from shadow (<10.20.3.180:36748>)
4/18 19:59:35 slot2: Remote job ID is 94.0
4/18 19:59:35 mkfifo of /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 error: No such file or directory (2)
4/18 19:59:35 failed to initialize named pipe at /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0
4/18 19:59:35 LocalClient: error initializing NamedPipeReader
4/18 19:59:35 ProcFamilyClient: failed to start connection with ProcD
4/18 19:59:35 register_subfamily: ProcD communication error
4/18 19:59:35 Create_Process: error registering family for pid 372920
4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD
4/18 19:59:35 slot2: ERROR: exec_starter failed!
4/18 19:59:35 slot2: ERROR: exec_starter returned 0
4/18 19:59:35 slot2: State change: received RELEASE_CLAIM command
4/18 19:59:35 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
4/18 19:59:35 slot2: State change: No preempting claim, returning to owner
4/18 19:59:35 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
4/18 19:59:35 slot2: State change: IS_OWNER is false
4/18 19:59:35 slot2: Changing state: Owner -> Unclaimed

Sched.log
4/16 09:44:55 (pid:401526) DaemonCore: Command Socket at <10.20.4.30:34417>
4/16 09:44:55 (pid:401526) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/16 09:44:55 (pid:401526) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/16 09:44:55 (pid:401526) History file rotation is enabled.
4/16 09:44:55 (pid:401526)   Maximum history file size is: 20971520 bytes
4/16 09:44:55 (pid:401526)   Number of rotated history files is: 2

MasterLog
4/16 09:44:55 DaemonCore: Command Socket at <10.20.4.30:34416>
4/16 09:44:55 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/16 09:44:55 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
4/16 09:44:55 Started DaemonCore process "/u1/pv/.condor-7.0.1/sbin/condor_schedd", pid and pgroup = 401526
4/16 09:44:55 Started DaemonCore process "/u1/pv/.condor-7.0.1/sbin/condor_startd", pid and pgroup = 356550
4/16 10:44:55 Preen pid is 430084

Thanks in Advance

javed