[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Stop Vanilla jobs from eviction/restart



On 6/20/2013 1:41 PM, Prem Kumar wrote:
hi Todd, thank you for your response.

i matched all of those settings in the link that you shared, and to my
surprise they are exactly the same what it needs to be to disable
preemption.

Did you remember to do a condor_reconfig -all (from a trusted machine, aka your central manager) when making the config file edits? The condor_config_val -dump is just reading from the config file, if the file has been edited more recently than a reconfig...

Also, do you have the same config file setup on all nodes, or do you have a different config file on your CM -vs- your execute nodes?

Could the job restarts have been from before you made the config changes?

Could the job restarts be a result of something outside of HTCondor's control, such as reboot of an execute node or restart of the HTCondor service?

Could the job restarts be a result of the jobs going on hold (for some error reason like NFS server temporarily being down) and then released?

What version of HTCondor are you running?

If you are running v7.8 or earlier and you never want to interrupt a running job, make certain of your central manager condor_config you have:
   PREEMPTION_REQUIREMENTS = False
and on all of your execute node condor_config you have:
   PREEMPT=FALSE
   KILL=FALSE
   RANK=0

If you are running HTCondor v8.0+ and you never want to interrupt a running job, life can be simpler - I would suggest making certain all your execute nodes condor_config have something like
  MAXJOBRETIREMENTTIME = 172800
which specifies how many seconds a job can run uninterrupted (172800 is 2 days, set to whatever).

In the current developer series, we are adding to the startd classad information about how many times a job was interrupted by HTCondor - this will make it easier to confirm that the system is indeed doing what you think you are telling it :).

regards
Todd