[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception!



All jobs submitted to my cluster return an error that reads - 
007 (1752.000.000) 11/05 22:48:06 Shadow exception!
        Can no longer talk to condor_starter <10.178.6.159:50295>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job


The SchedLog on the submit node shows errors that look like this -
11/05/12 22:56:19 (pid:1818) Started shadow for job 1753.2 on slot6@xxxxxxxxxxxxxxxxxx <10.178.6.159:50295> for dcshrum, (shadow pid = 16667)
11/05/12 22:56:20 (pid:1818) Shadow pid 16664 for job 1753.0 exited with status 4
11/05/12 22:56:20 (pid:1818) ERROR: Shadow exited with job exception code!
11/05/12 22:56:20 (pid:1818) Checking consistency running and runnable jobs
11/05/12 22:56:20 (pid:1818) Tables are consistent

The processing nodes all show errors that look like this -  
11/05/12 22:58:21 Create_Process(/usr/sbin/condor_starter): child failed because it failed to register itself with the ProcD
11/05/12 22:58:21 slot12: ERROR: exec_starter failed!
11/05/12 22:58:21 slot12: ERROR: exec_starter returned 0
11/05/12 22:58:23 slot6: Got activate_claim request from shadow (10.175.14.0)
11/05/12 22:58:23 slot6: Remote job ID is 133559.0
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 error writing to named pipe: watchdog pipe has closed
11/05/12 22:58:23 LocalClient: error sending message to server
11/05/12 22:58:23 ProcFamilyClient: failed to start connection with ProcD
11/05/12 22:58:23 register_subfamily: ProcD communication error
11/05/12 22:58:23 Create_Process: error registering family for pid 31831


It looks like something related to the network; although in my case our condor cluster is a dedicated cluster and all the nodes are on an internal network.  My best guess is the network but I'm stuck right now and any pointers to debug this would be appreciated.

Thanks,
Don 
FSU HPC