[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to forbid job restarts



After an extensive web-search, I do not seem to find an answer to a simple question: how do I forbit HTCondor to restart my jobs?

[...]

How can I tell HTCondor that it is forbidden to restart jobs and all the jobs should be allowed to finish no matter how long it takes?

Generally speaking, whether or not a job is _interrupted_ is up to the administrator of the startd on which the job is run (and the vagaries
of random failures).  Because of the distributed nature of HTCondor, it's
not possible to ensure that job is only ever started once (a startd could
fall off the network after it receives the job but before it starts it,
for example), but see the following for a discussion of the best that you
can do as a job owner to prevent your job from _restarting_.

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts

What could be the reason the jobs started to restart execution periodically when run as part of a DAG?

	That is indeed a very good question.

I am the administrator of my HTCondor cluster, so I am sure that nether HTCondor configuration parameters were changed, nor the individual job submit files were changed.

What happens if run a DAG like the one which is failing, but whose jobs just sleep for 75 minutes?

- ToddM