[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Trivial jobs occasionally running for hours



We're in the process of refreshing our Condor provision and have a cluster with 7.4.3 Linux central manager/submit node and 7.4.2 Windows 7 worker nodes.

Occasionally we see trivial jobs being accepted by nodes and then (apparently) running for hours.  At the moment this is a completely trivial job (batch script doing "echo hello") - I'm queuing 100 instances of the job.

The workers are staying online all the time (they get pinged every five minutes as part of our data gathering).  

I sometimes see the following in STARTD log but not associated with the slot or job which is hanging around ...

10/13 12:45:54 slot3: State change: claim-activation protocol successful
10/13 12:45:54 slot3: Changing activity: Idle -> Busy
10/13 12:45:54 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:51442>.
10/13 12:45:54 IO: Failed to read packet header
10/13 12:45:54 Starter pid 3752 exited with status 4
10/13 12:45:54 slot3: State change: starter exited
10/13 12:45:54 slot3: Changing activity: Busy -> Idle

Picking one of the four jobs that are hanging around I see the following in ShadowLog on the central manager/submit node ...

10/13 12:19:11 Initializing a VANILLA shadow for job 1772.21
10/13 12:19:11 (1772.21) (13082): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.15.0.62:49389> was ACCEPTED
10/13 12:19:11 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:11 (1772.21) (13082): ERROR: SECMAN:2003:TCP auth connection to <10.8.232.5:9605> failed.
10/13 12:19:11 (1772.21) (13082): Failed to send alive to <10.8.232.5:9605>, will try again...
10/13 12:19:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:19:18 (1772.21) (13082): Sending NO GoAhead for 10.15.0.62 to send /home/ucs/200/nph9/esw3/ex.err.
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:34:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:34:18 (1772.21) (13082): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.8.232.5:9697>

Whilst the jobs were being farmed out I'd see occasional failures of condor_status with the "CEDAR:6001 error".

My instinct is that something on the manager/submit node is running out of resources (file descriptors, ports, something else I've not thought of) and reality and Condor then manage to get out of sync.

(a) Is there anything in particular I should be looking at in terms of system limits on the manager?  I've looked at http://www.cs.wisc.edu/condor/condorg/linux_scalability.html and don't think I'm hitting any of those (but happy to be told that is where I should be looking).

(b) Any other logs/tools I should be looking at to help diagnose this?

Thanks
Paul
-- 
Paul Haldane   
Information Systems and Services   
Newcastle University