[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow exception!



I'm posting this so it's archived for others... it turn out my processing nodes are out of UDP ports.  This is a bug I've seen with condor before related to not using DNS.  I'm hoping a restart of condor with DNS turned back on will resolve it.

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shrum, Donald C
Sent: Tuesday, November 06, 2012 9:08 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Shadow exception!

I've found I get this error from the ShadowLog on my submit node.  

11/06/12 09:04:56 (1755.0) (28510): ERROR "Can no longer talk to condor_starter <10.178.6.163:60689>" at line 222 in file /home/condor/execute/dir_491/userdir/src/condor_shadow.V6.1/NTreceivers.cpp

On a lark I tried to just reboot my submit node but that didn't seem to fix much.

We are running Redhat 6.x on the cluster.  My next best guess will be to just update from 7.8.4 to 7.8.6

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shrum, Donald C
Sent: Monday, November 05, 2012 11:02 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Shadow exception!

All jobs submitted to my cluster return an error that reads - 
007 (1752.000.000) 11/05 22:48:06 Shadow exception!
        Can no longer talk to condor_starter <10.178.6.159:50295>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job


The SchedLog on the submit node shows errors that look like this -
11/05/12 22:56:19 (pid:1818) Started shadow for job 1753.2 on slot6@xxxxxxxxxxxxxxxxxx <10.178.6.159:50295> for dcshrum, (shadow pid = 16667)
11/05/12 22:56:20 (pid:1818) Shadow pid 16664 for job 1753.0 exited with status 4
11/05/12 22:56:20 (pid:1818) ERROR: Shadow exited with job exception code!
11/05/12 22:56:20 (pid:1818) Checking consistency running and runnable jobs
11/05/12 22:56:20 (pid:1818) Tables are consistent

The processing nodes all show errors that look like this -  
11/05/12 22:58:21 Create_Process(/usr/sbin/condor_starter): child failed because it failed to register itself with the ProcD
11/05/12 22:58:21 slot12: ERROR: exec_starter failed!
11/05/12 22:58:21 slot12: ERROR: exec_starter returned 0
11/05/12 22:58:23 slot6: Got activate_claim request from shadow (10.175.14.0)
11/05/12 22:58:23 slot6: Remote job ID is 133559.0
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 Sock::bind failed: errno = 98 Address already in use
11/05/12 22:58:23 error writing to named pipe: watchdog pipe has closed
11/05/12 22:58:23 LocalClient: error sending message to server
11/05/12 22:58:23 ProcFamilyClient: failed to start connection with ProcD
11/05/12 22:58:23 register_subfamily: ProcD communication error
11/05/12 22:58:23 Create_Process: error registering family for pid 31831


It looks like something related to the network; although in my case our condor cluster is a dedicated cluster and all the nodes are on an internal network.  My best guess is the network but I'm stuck right now and any pointers to debug this would be appreciated.

Thanks,
Don 
FSU HPC

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/