[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] slot cool down time



On 23/08/21 16:39, Beyer, Christoph wrote:
Hi,

I would like a certain type of slots to cool down after a finished or failed job. Reason is these are interactive jobs and I want the 2nd try of the user (complete new job) not to run on the same node that did not succeed before.

In my tiny brain I thought that something like:

START = $(START) && ((time() - EnteredCurrentState) > 600)

Would do the trick just fine and effectly cause a 10 minute waiting time but apparently it does not - any suggestions ?

Best
christoph

Hello Christoph,
my understanding is that EnteredCurrentState
is the job classad of a "not yet started job", so that clause would match jobs pending for more than 10 mins.

I had a similar problem (did not want a singlecore start immediately soon after a multicore left). I defined a custom machine classad "MC_GRACE" whose numeric value is set by a startd cronjob running at the node;

In this case, the idea is that the cronjob (running every min) keeps a numeric value (say x) set to 1.0 while a given condition is False (example: a job of Owner "foo" is running).
When it is not, it scales the x value; for example:ÂÂ x <-- 0.75 * x
then you set the machine classad as, say: OKJOB = (x < 0.1).

In this case the OKJOB would become True 9 minutes after the condition who keeps x at 1.0 has gone. Hope this idea could help, eventhought my explanation could be a bit confused.

Stefano