[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job CPU usage updates



On 9/24/2013 12:26 AM, Wilkins, David wrote:

Hello.

We are considering mechanisms to terminate jobs that are in the
running state but consuming very little CPU, on the assumption that
such jobs are hung up in some way, perhaps with some pop-up error
alert that is waiting on user input.

This could be done by including a periodic_remove expression in the
submit file, making use of the RemoteUserCpu ClassAd to compare CPU
usage to job elapsed time.

However, RemoteUserCpu is only updated at the frequencies defined by
STARTER_UPDATE_INTERVAL on the processing node and then
SHADOW_QUEUE_UPDATE_INTERVAL on the submitting node, defaulting to 5
min and 15 min respectively. Before evaluating any periodic_remove
expression, we need to know that at least the first pair of updates
has occurred, since prior to that it will simply remain at 0.


Looks like Ben gave you a potential solution, but another one comes to mind: if you also control the configuration of your execute machines, you could have the "remove jobs using little CPU" policy enforced via the startd on the execute side. Perhaps things could even be simpler that way, e.g. kill if low load average 5 minutes after starting (assuming a correctly running jobs keeps the load average high). Normally when an execute node policy kicks off a job (via PREEMPT), the job goes back to Idle and is re-run. The trick to getting the job to be removed is to leverage the WANT_HOLD config knob, which is an expression evaluated on the execute machine that will place the job on hold when true. See

http://research.cs.wisc.edu/htcondor/manual/v8.1/3_3Configuration.html#19100

You could even specify a nice little hold reason/code that explains why the job went on hold ("not using any cpu after 5 minutes"). Then if you want jobs held this way to be automatically removed, setup periodic_remove (or system_periodic_remove).

Just a thought,
Todd