[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job CPU usage updates



Thanks Todd. I have prototyped a solution along these lines, and it seems to work fine.  For example, I put these lines into the condor_config of a processing node:

STARTER_UPDATE_INTERVAL = 60
RUN_TIME = (CurrentTime - EnteredCurrentStatus)
CPU_USAGE = (RemoteUserCpu + RemoteSysCpu)

LOW_CPU = ((LowCPUTimeout =!= UNDEFINED) && \
            (LowCPULimit =!= UNDEFINED) && \
            ($(RUN_TIME) > ($(STARTER_UPDATE_INTERVAL)+15)) && \
            ($(RUN_TIME) > LowCPUTimeout) && \
            (($(CPU_USAGE)*100) / $(RUN_TIME) < LowCPULimit))

PREEMPT = ($(PREEMPT) || $(LOW_CPU))
WANT_HOLD = ($(LOW_CPU))
WANT_HOLD_REASON = ifThenElse($(LOW_CPU), "Job is using low CPU", undefined)

The parameters LowCPUTimeout (in secs) and LowCPULimit (as percentage) are passed in as custom ClassAds from the job (taking the hint from your reply to another question earlier this week). If a job happens to not define them, LOW_CPU will remain false and the job will never he held.

We still need to wait until the first update from starter to startd has been done, but on the processing node we can achieve that by defining STARTER_UPDATE_INTERVAL explicitly, and then not evaluating LOW_CPU until that interval has been exceeded (plus a few seconds padding as per Ben's suggestion).

On the submitting node, the job then includes periodic_remove and periodic_release expressions that evaluate the reason for the hold, and then either remove the job or release it depending on whether it should be resubmitted for execution on another machine.

This is very effective at removing a job that uses little or no CPU from the get-go, e.g. either because it failed to start the executable, or perhaps because some problem with input data caused an unexpected crash.

But where it falls down slightly is if the job initially processes normally, consuming a lot of CPU, but then hangs up after a few minutes. It will then take a fair while for the average CPU usage to dip below the threshold for the job to be assessed as being hung. So I was thinking it would be good if the expression could assess CPU usage over a limited time slice, e.g. the last number of seconds as defined by LowCPUTimeout. I've dug through the manual trying to find something that might help, without success. Is there some way that it could be done...?

Thanks,
David




Diese E-Mail wurde versandt im Auftrag des Unternehmens Intergraph (Schweiz) AG
Vertretungsberechtigte Gesch?ftsf?hrer: Marc H?nni
Pr?sident des Verwaltungsrates: Marc H?nni; Mitglied des Verwaltungsrates: Dr. Peter Karl Neuenschwander
Sitz der Gesellschaft: Neumattstrasse 24, Postfach, 8953 Dietikon 1, Schweiz, Tel. +41 (0)43 322 46 46
Eingetragen beim Handelsgericht des Kantons Z?rich - Hauptregister Nr.: CH-020.3.913.558-2

This E-Mail has been sent on behalf of the company Intergraph (Schweiz) AG
Authorised Managing Director: Marc H?nni
Chairman of the Board of Directors: Marc H?nni; Member of the Board of Directors: Dr. Peter Karl Neuenschwander
Registered office and Swiss headquarters: Neumattstrasse 24, Postfach, 8953 Dietikon 1, Switzerland, Tel. +41 (0)43 322 46 46
The company is recorded in the commercial register of the Canton of Zurich under number of the main register CH-020.3.913.558-2

Diese E-Mail (mit zugeh?rigen Dateien) enth?lt m?glicherweise Informationen, die vertraulich sind, dem Urheberrecht unterliegen oder ein Gesch?ftsgeheimnis darstellen. Falls Sie diese Nachricht irrt?mlicherweise erhalten haben, benachrichtigen Sie uns bitte umgehend, indem Sie eine Antwort senden, und l?schen Sie bitte diese E-Mail und ihre Antwort darauf. S?mtliche aufgef?hrten Ansichten oder Meinungen sind ausschliesslich diejenigen des Autors und entsprechen nicht notwendigerweise denen des Unternehmens Intergraph.

This E-Mail (and any attachments) may be confidential and protected by legal privilege. If you are not the intended recipient please notify us immediately by replying to the sender and delete this E-Mail and your reply from your system. All the views and opinions published here are solely based on the author's own opinion and should not be considered necessarily as reflecting the opinion of Intergraph.


-----Original Message-----
From: Todd Tannenbaum [mailto:tannenba@xxxxxxxxxxx]
Sent: Tuesday, September 24, 2013 19:25
To: HTCondor-Users Mail List
Cc: Wilkins, David
Subject: Re: [HTCondor-users] Job CPU usage updates

On 9/24/2013 12:26 AM, Wilkins, David wrote:
>
> Hello.
>
> We are considering mechanisms to terminate jobs that are in the
> running state but consuming very little CPU, on the assumption that
> such jobs are hung up in some way, perhaps with some pop-up error
> alert that is waiting on user input.
>
> This could be done by including a periodic_remove expression in the
> submit file, making use of the RemoteUserCpu ClassAd to compare CPU
> usage to job elapsed time.
>
> However, RemoteUserCpu is only updated at the frequencies defined by
> STARTER_UPDATE_INTERVAL on the processing node and then
> SHADOW_QUEUE_UPDATE_INTERVAL on the submitting node, defaulting to 5
> min and 15 min respectively. Before evaluating any periodic_remove
> expression, we need to know that at least the first pair of updates
> has occurred, since prior to that it will simply remain at 0.
>

Looks like Ben gave you a potential solution, but another one comes to
mind:  if you also control the configuration of your execute machines, you could have the "remove jobs using little CPU" policy enforced via the startd on the execute side.  Perhaps things could even be simpler that way, e.g. kill if low load average 5 minutes after starting (assuming a correctly running jobs keeps the load average high).
Normally when an execute node policy kicks off a job (via PREEMPT), the job goes back to Idle and is re-run.  The trick to getting the job to be removed is to leverage the WANT_HOLD config knob, which is an expression evaluated on the execute machine that will place the job on hold when true.  See

http://research.cs.wisc.edu/htcondor/manual/v8.1/3_3Configuration.html#19100

You could even specify a nice little hold reason/code that explains why the job went on hold ("not using any cpu after 5 minutes").  Then if you want jobs held this way to be automatically removed, setup periodic_remove (or system_periodic_remove).

Just a thought,
Todd