[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] startd job count limit to limit the damage of black holes



Hello Condor Community,

Is there any way to have startd only start N jobs and then stop matching any more? For instance, I often want N=1 so that only one job can execute on a new machine added to a cluster, though I can imagine other values of N might also be of use in some cases. A mis-configuration of a new node all too often causes jobs to fail quickly, so another job starts and fails and so on, thus creating a black hole, quickly draining our queue without doing anything useful. Initially limiting the total number of started jobs to 1 until the node is shown to successfully run our jobs would help me tremendously. Something like

START = (TotalJobsStarted < 2)Â # where TotalJobsStarted is the missing piece that I've yet to find, so am seeking your help.

A different approach might be to add in a lengthy delay between the time a job finishes and the time another job is started. With NUM_SLOTS = 1 and a few minutes delay between a job's immediate failure (which condor only sees as a successful completion) and a new job starting, I could manually detect the failure of a job and shutdown condor on the black hole node until I figure out the cause of the failure and try again. The submit option "keep_claim_idle" looks like it does something like this, but is generally undesirable, and I'd rather have something like this on the startd side, rather than on the submit side. Is there such an option/classad for startd? (It wasn't clear to me if setting JOB_START_DELAY to a large value would do the trick, so I tried it and it did not help).

Btw, I found https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles, but I don't see how it helps, since I'd rather not drain the queue out completely in the first place. If a single job fails, our submission system will (eventually) detect it, and it will be resubmitted without any significant loss, but if the entire queue is emptied because of all idle jobs going to the black hole, then we start losing CPU cycles.

Thank you for your time,

-Wayne