[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] 2 questions about job retry

Hi Nicolas,

The machine ad contains the attributes RecentJobDurationAvg and
RecentJobDurationCount that may help you with your second issue.
From the manual [1]:

    The Average lifetime time of all jobs, not including time
    spent transferring files, that have exited in the last 20
    minutes. This attribute will be undefined if no job has exited
    in the last 20 minutes.

    The total number of jobs used to calculate the
    RecentJobDurationAvg attribute. This is the total number of
    jobs that began execution and have exited in the last 20

These can be used to figure out if a machine has been starting
a lot of really short jobs recently.  In the OSG we define an
expression like this in the config file:

    IsBlackHole = IfThenElse(RecentJobDurationAvg =?= undefined,false,RecentJobDurationCount >= 10 && RecentJobDurationAvg < 180)

IsBlackHole evaluates to true if in the last 20 minutes at least
10 jobs exited after less than an average of 3 minutes on this

Then, use STARTD_ATTRS [2] to put that expression into the machine ad:


And then modify the machine's START expression to use IsBlackHole:

    START = $(START) && (IsBlackHole =!= true)

Thus, the machine will stop accepting new jobs if it's run too
many short jobs recently.

Note that IsBlackHole will reset itself after about 20 minutes
because RecentJobDurationAvg will become undefined after not
having run any jobs in that time.  You could consider using
DAEMON_SHUTDOWN [3] to actually turn off the startd instead
of just not accepting new jobs.


[1] https://htcondor.readthedocs.io/en/latest/classad-attributes/machine-classad-attributes.html#RecentJobDurationAvg
[2] https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#%3CSUBSYS%3E_ATTRS
[3] https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DAEMON_SHUTDOWN

On 8/19/2022 4:28 PM, Nicolas Arnaud wrote:

Hello Cole,

Thanks for your answers.

DAGMan RETRY is not very tunable. Its two features are just retry
n-times and don't retry if received exit signal the one specified with
the optional UNLESS-EXIT but to elaborate on your questions.

  1. I didn't really find a good way to set a delay before attempting to
     retry a Node. You could make a config file with the expression
     DAGMAN_SUBMIT_DELAY=integer and use the CONFIG line in your dag
     file. The issue with this is it effects all nodes not just nodes
     retrying to run (All nodes will be have delayed submission time of
     n). If you explore your route of a POST script, I would beware of a
     couple things. First, if you don't have DAGMAN_ALWAYS_RUN_POST=true
     in your config then if there is a PRE script and it fails the POST
     script won't run at all. Second, if the POST script exits
     successfully then the Node will be marked as completing successfully
     and in turn not run a retry so you would want to exit that script as
     non-zero if other parts returned failure. Adding an easy way to
     delay a retry may be helpful so I will discuss with the team to see
     if we want to implement any specific feature to help this.

That's great: thanks.

The use case I can see is for jobs that last long -- tens of minutes or
more -- and are sent automatically to the Condor pool when some
conditions are met. If the job fails (immediately) because of a
transient unrelated computing problem, it may be worth delaying the
retries by some tens of seconds. That may be long enough to have the
transient problem fix by itself, but that wouldn't change by much the
overall duration of the job if the next retry works. As opposed to
having all the retries set in the DAG file be consumed quickly, because
the job (re)tries fail repeatedly as they all encounter the same problem.

  2. I don't have a solution to not running a dag node on the same
     machine, but I do have explanation to why adding > requirements =
     Machine =!= LastRemoteHost did not work. When a DAG node retries,
     the submit file is resubmitted to the condor system resulting in a
     new cluster.proc.subproc job with a fresh job ad. Thus,
     LastRemoteHost doesn't exist in the job ad yet because that job
     hasn't run anywhere yet. Jobs can fail for lots of reasons and not
     just from execute machine failure. If you feel like your job is
     excessively failing let me know and I can do my best to help solve
     the problem.

If a machine of a busy Condor pool has some problem, jobs that run on it
will crash (quickly) and so that machine will look more "available": it
will host more jobs which will crash as well, etc. That's this kind of
behavior that I was trying to workaround with my second question.



Best of luck,
Cole Bollig
*From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of
Nicolas Arnaud <nicolas.arnaud@xxxxxxxxxxxxxxx>
*Sent:* Friday, August 19, 2022 9:45 AM
*To:* HTCondor Users <htcondor-users@xxxxxxxxxxx>
*Subject:* [HTCondor-users] 2 questions about job retry


I have a couple questions about how to tune the retry of a failed DAG job.

1) What's the best way to wait some seconds before attempting a retry?

I've thought of using a POST script that would have $RETURN among its
arguments and call |sleep| if $RETURN is not equal to 0, but I wonder
whether that would work and whether there is a simpler way to do
something similar.

2) When a job retries, I would like it *not* to run where the failed job
has run. Searching on the web lead me to adding the line

requirements = Machine =!= LastRemoteHost

to the submit file that is called by the JOB command on the DAG file,
but that doesn't seem to work. More often than not, the job reruns in
the same place (same machine and same slot) than the failed try.

The Condor version I am using is

$CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $
$CondorPlatform: x86_64_CentOS7 $

Thanks in advance,