Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] 2 questions about job retry

Date: Fri, 19 Aug 2022 23:28:11 +0200
From: Nicolas Arnaud <nicolas.arnaud@xxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] 2 questions about job retry


Hello Cole,

Thanks for your answers.

DAGMan RETRY is not very tunable. Its two features are just retryn-times and don't retry if received exit signal the one specified withthe optional UNLESS-EXIT but to elaborate on your questions.


 1. I didn't really find a good way to set a delay before attempting to
    retry a Node. You could make a config file with the expression
    DAGMAN_SUBMIT_DELAY=integer and use the CONFIG line in your dag
    file. The issue with this is it effects all nodes not just nodes
    retrying to run (All nodes will be have delayed submission time of
    n). If you explore your route of a POST script, I would beware of a
    couple things. First, if you don't have DAGMAN_ALWAYS_RUN_POST=true
    in your config then if there is a PRE script and it fails the POST
    script won't run at all. Second, if the POST script exits
    successfully then the Node will be marked as completing successfully
    and in turn not run a retry so you would want to exit that script as
    non-zero if other parts returned failure. Adding an easy way to
    delay a retry may be helpful so I will discuss with the team to see
    if we want to implement any specific feature to help this.


That's great: thanks.

The use case I can see is for jobs that last long -- tens of minutes ormore -- and are sent automatically to the Condor pool when someconditions are met. If the job fails (immediately) because of atransient unrelated computing problem, it may be worth delaying theretries by some tens of seconds. That may be long enough to have thetransient problem fix by itself, but that wouldn't change by much theoverall duration of the job if the next retry works. As opposed tohaving all the retries set in the DAG file be consumed quickly, becausethe job (re)tries fail repeatedly as they all encounter the same problem.

 2. I don't have a solution to not running a dag node on the same
    machine, but I do have explanation to why adding > requirements =
    Machine =!= LastRemoteHost did not work. When a DAG node retries,
    the submit file is resubmitted to the condor system resulting in a
    new cluster.proc.subproc job with a fresh job ad. Thus,
    LastRemoteHost doesn't exist in the job ad yet because that job
    hasn't run anywhere yet. Jobs can fail for lots of reasons and not
    just from execute machine failure. If you feel like your job is
    excessively failing let me know and I can do my best to help solve
    the problem.

If a machine of a busy Condor pool has some problem, jobs that run on itwill crash (quickly) and so that machine will look more "available": itwill host more jobs which will crash as well, etc. That's this kind ofbehavior that I was trying to workaround with my second question.


Cheers,

Nicolas

Best of luck,
Cole Bollig
------------------------------------------------------------------------

*From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf ofNicolas Arnaud <nicolas.arnaud@xxxxxxxxxxxxxxx>

*Sent:* Friday, August 19, 2022 9:45 AM
*To:* HTCondor Users <htcondor-users@xxxxxxxxxxx>
*Subject:* [HTCondor-users] 2 questions about job retry

Hello,

I have a couple questions about how to tune the retry of a failed DAG job.

1) What's the best way to wait some seconds before attempting a retry?

I've thought of using a POST script that would have $RETURN among its
arguments and call |sleep| if $RETURN is not equal to 0, but I wonder
whether that would work and whether there is a simpler way to do
something similar.

2) When a job retries, I would like it *not* to run where the failed job
has run. Searching on the web lead me to adding the line

requirements = Machine =!= LastRemoteHost


to the submit file that is called by the JOB command on the DAG file,
but that doesn't seem to work. More often than not, the job reruns in
the same place (same machine and same slot) than the failed try.

The Condor version I am using is

condor_version $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $
$CondorPlatform: x86_64_CentOS7 $


Thanks in advance,

Nicolas
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users<https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>


The archives can be found at:

https://lists.cs.wisc.edu/archive/htcondor-users/<https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] 2 questions about job retry
  - From: MÃtyÃs Selmeci

References:
- [HTCondor-users] 2 questions about job retry
  - From: Nicolas Arnaud
- Re: [HTCondor-users] 2 questions about job retry
  - From: Cole Bollig

Prev by Date: Re: [HTCondor-users] condor_q SECMAN:2007:Failed to end classad message
Next by Date: [HTCondor-users] HTCondor and Wine and Ubuntu
Previous by thread: Re: [HTCondor-users] 2 questions about job retry
Next by thread: Re: [HTCondor-users] 2 questions about job retry
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] 2 questions about job retry