[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to forbid job restarts



Hello,

The new version of HTCondor seems to work for me now:
 * making config files readable by everybody 
 * changing ownership of config files to user "condor"
 * uninstalling condor-annex-ec2 rpm
 * reboot
made HTCondor 8.8.12-1.el7.x86_64 functional.

Otherwise, I followed instructions here:

https://htcondor.readthedocs.io/en/latest/admin-manual/quick-start-condor-pool.html

Hopefully, it will help somebody.

Now, I will test if the periodic restarts of my jobs are gone.

Best,
Siarhei.


-----Original Message-----
From: Vaurynovich, Siarhei 
Sent: Wednesday, 13 January, 2021 18:41
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: [HTCondor-users] How to forbid job restarts


Hello Todd,

Thank you so much for your reply!

> What happens if run a DAG like the one which is failing, but whose jobs just sleep for 75 minutes?

After struggling with it for a few days, out of desperation, today I have decided to try a new version of Condor to see if it would resolve it. And it does not work yet (now, jobs do not start at all), so I cannot try it. But my normal jobs used to finish successfully when run with such input arguments, that allow them to finish quicker.

Currently:
 * condor_q # works
 * condor_status # Error: can't find collector
 * condor_config_val condor_host
    If I run it as a regular user, it gives "Not defined: CONDOR_HOST".
    If I run it as root, then I get correct value which I set in configuration files.
 * condor_config_val -address "<AAA.BBB.CCC.DDD:9618>" CONDOR_HOST
    where AAA.BBB.CCC.DDD is the IP address of the central manager, gives correct value even if I run it as a regular user.
 * nmap -Pn -p 9618 AAA.BBB.CCC.DDD
    Host is up (0.00021s latency).
    PORT     STATE SERVICE
    9618/tcp open  condor
    So, firewall/communication should not be a problem.

If it gives you any clue of what is failing, please, let me know. I am a research person, not an IT expert. My manager will gut me and then hang me on a nearby tree if I do not fix it soon :-(

> > What could be the reason the jobs started to restart execution 
> > periodically when run as part of a DAG?

> 	That is indeed a very good question.

My main hypothesis is that some old log files of the jobs confused somehow Condor and put it into a wrong state, maybe by changing some internal files. For some reason, even completely unrelated jobs started to get restarted periodically, even though those jobs worked flawlessly for a few years now and there were many thousands of them executed by now.

Thank you very much and I hope somebody will be able to give me a lifesaving clue, Siarhei.


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd L Miller
Sent: Wednesday, 13 January, 2021 16:33
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to forbid job restarts

  External Email. Use caution when clicking links or opening file attachments.

> After an extensive web-search, I do not seem to find an answer to a 
> simple question: how do I forbit HTCondor to restart my jobs?

[...]

> How can I tell HTCondor that it is forbidden to restart jobs and all 
> the jobs should be allowed to finish no matter how long it takes?

 	Generally speaking, whether or not a job is _interrupted_ is up to the administrator of the startd on which the job is run (and the vagaries of random failures).  Because of the distributed nature of HTCondor, it's not possible to ensure that job is only ever started once (a startd could fall off the network after it receives the job but before it starts it, for example), but see the following for a discussion of the best that you can do as a job owner to prevent your job from _restarting_.

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts

> What could be the reason the jobs started to restart execution 
> periodically when run as part of a DAG?

 	That is indeed a very good question.

> I am the administrator of my HTCondor cluster, so I am sure that 
> nether HTCondor configuration parameters were changed, nor the 
> individual job submit files were changed.

 	What happens if run a DAG like the one which is failing, but whose jobs just sleep for 75 minutes?

- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone.  Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html