Re: [HTCondor-users] How to forbid job restarts

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello Stefano,

Thank you for your reply!

According to the documentation, It seems, that RETRY command does not correspond to my problem:

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#retrying-failed-nodes

since it controls the “number of times to retry the node after failure”. I do not have such commands in my dag file. In my case, the jobs do not fail. If I just pick some restarted multiple times job and run it interactively on the same machine – it finishes just fine. Instead, they are restarted by HTCondor periodically (every about 60-70 minutes) before they can finish and this is what I want to prevent. Some of my jobs are shorter than the others, and the shorter jobs finish successfully if they can do it within an hour. The problem is that after some time only longer jobs end up being queued and the whole progress gets to a stall, since the longer jobs get restarted by HTCondor in a loop indefinitely. I do not want to hold them or to remove them. I need them to finish.

Best,

Siarhei.

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Sent: Monday, 11 January, 2021 13:51
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Vaurynovich, Siarhei <siarhei.vaurynovich@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to forbid job restarts

External Email. Use caution when clicking links or opening file attachments.

Hello, not sure that helps your case, anyway:

In the dag file one can specify:

JOB A A.sub
RETRY A 5 # see also UNLESS_EXIT: retry on some exit codes only

Probably RETRY A 0 would disable restarts (in case of DAG jobs).

For general jobs i've set the following job transform rule in the schedd:

JOB_TRANSFORM_NoRestart @=end
   REQUIREMENTS True
   if defined My.Requirements
      SET Requirements (NumJobStarts == 0) && ( $(My.Requirements) )
   else
      SET Requirements (NumJobStarts == 0)
   endif
@end

SYSTEM_PERIODIC_HOLD = ( $(SYSTEM_PERIODIC_HOLD:False) || (NumJobStarts == 1 && JobStatus == 1) )
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*6)

Stefano

Il 11/01/21 19:18, Vaurynovich, Siarhei ha scritto:

Hello,

After an extensive web-search, I do not seem to find an answer to a simple question: how do I forbit HTCondor to restart my jobs?

I have a type of jobs, which I used to ran as independent jobs and they were always allowed to finish by HTCondor. I have upgraded the process to be more efficient in theory by running those jobs as a DAG, which consists of multiple (hundreds) of independent graphs (i.e. no parent/child links between them). And now, HTCondor does not allow the jobs to finish since its keeps restarting (after about an hour of running) them before they could complete (NumJobStarts keeps incrementing and the run time of a job as seen in Linux top keeps being reset to zero).

How can I tell HTCondor that it is forbidden to restart jobs and all the jobs should be allowed to finish no matter how long it takes?

What could be the reason the jobs started to restart execution periodically when run as part of a DAG?

I am the administrator of my HTCondor cluster, so I am sure that nether HTCondor configuration parameters were changed, nor the individual job submit files were changed.

Thank you very much for your help,

Siarhei.

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone. Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] How to forbid job restarts