[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to access the number of times a job has been put on hold and released



Is it possible to determine the number of times a job has been put on hold and then released?

There doesn't seem to be a direct job classad that shows this.

We quite often have user submit files that put a job on hold if it is running for more than a specified time.
This is mainly to combat jobs that seem to fall into a black hole on execute nodes occasionally.
A period_release than allows them to try running somewhere else, hopefully successfully this time.

We have a user who would like to remove a job if it has benn put on hold and released more than X number of times.

e.g. using imaginary job classad NumHoldsReleases then we could change:

on_exit_remove = (ExitCode == 0) && (ExitBySignal == False)

to

on_exit_remove = ((ExitCode == 0) && (ExitBySignal == False)) || (NumHoldsReleases > 5)

Is there a way to achieve this?

Thanks.

Cheers

Greg