[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trivial jobs occasionally running for hours



Condor isn't particularly adept and handling very short running jobs. So if your jobs are only running an echo and then ending you could end up with a lot of machines Claimed+Idle as the schedd can't handle spawning shadows fast enough to keep up with the job startup rate.

And then there's Windows. I don't know where to start with Windows. There's certainly a lot of issues that come up when you're using Windows and network storage and you want to use that storage repeatedly, quickly. And ports. Windows doesn't seem to recycle the pool of avaiable ports very quickly so shadows coming and going, fast, will exhaust that pool and you'll end up with no-comm errors between the shadows and startds.

That's my experience at least.

I suggest adding a sleep to your simple jobs so that they don't run shorter than 2-3 minutes and see if that improves the situation. You could also throttle the job startup rate at the schedd. Get yourself to a stable state and then start to decrease the sleep time or increase the job startup rate until you're seeing the failures, then back it off a bit.

- Ian

On Wed, Oct 13, 2010 at 11:25 AM, Paul Haldane <paul.haldane@xxxxxxxxxxxxxxx> wrote:
We're in the process of refreshing our Condor provision and have a cluster with 7.4.3 Linux central manager/submit node and 7.4.2 Windows 7 worker nodes.

Occasionally we see trivial jobs being accepted by nodes and then (apparently) running for hours.  At the moment this is a completely trivial job (batch script doing "echo hello") - I'm queuing 100 instances of the job.

The workers are staying online all the time (they get pinged every five minutes as part of our data gathering).

I sometimes see the following in STARTD log but not associated with the slot or job which is hanging around ...

10/13 12:45:54 slot3: State change: claim-activation protocol successful
10/13 12:45:54 slot3: Changing activity: Idle -> Busy
10/13 12:45:54 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:51442>.
10/13 12:45:54 IO: Failed to read packet header
10/13 12:45:54 Starter pid 3752 exited with status 4
10/13 12:45:54 slot3: State change: starter exited
10/13 12:45:54 slot3: Changing activity: Busy -> Idle

Picking one of the four jobs that are hanging around I see the following in ShadowLog on the central manager/submit node ...

10/13 12:19:11 Initializing a VANILLA shadow for job 1772.21
10/13 12:19:11 (1772.21) (13082): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.15.0.62:49389> was ACCEPTED
10/13 12:19:11 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:11 (1772.21) (13082): ERROR: SECMAN:2003:TCP auth connection to <10.8.232.5:9605> failed.
10/13 12:19:11 (1772.21) (13082): Failed to send alive to <10.8.232.5:9605>, will try again...
10/13 12:19:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:19:18 (1772.21) (13082): Sending NO GoAhead for 10.15.0.62 to send /home/ucs/200/nph9/esw3/ex.err.
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:34:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:34:18 (1772.21) (13082): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.8.232.5:9697>

Whilst the jobs were being farmed out I'd see occasional failures of condor_status with the "CEDAR:6001 error".

My instinct is that something on the manager/submit node is running out of resources (file descriptors, ports, something else I've not thought of) and reality and Condor then manage to get out of sync.

(a) Is there anything in particular I should be looking at in terms of system limits on the manager?  I've looked at http://www.cs.wisc.edu/condor/condorg/linux_scalability.html and don't think I'm hitting any of those (but happy to be told that is where I should be looking).

(b) Any other logs/tools I should be looking at to help diagnose this?

Thanks
Paul
--
Paul Haldane
Information Systems and Services
Newcastle University
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/