[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Controlling Jobs - Disabling/preventing a job from being suspended



On 9/24/2013 7:36 PM, Andrey Kuznetsov wrote:
Hi,

The nodes are setup to suspend jobs based on various conditions.
I'm wondering is it possible to pass an argument or setting in condor
submit file that disables SUSPEND mode?

I have a job that eats up a lot of RAM and cannot be checkpointed, so I
want to make sure that it finishes running no matter.

I also do not want to modify the node's suspend settings unless it's to add
a conditional statement that I can then manipulate from the condor submit
file. (don't know if that's possible)


I assume you are running HTCondor v8.0.x or above... (you didn't say)

The WANT_SUSPEND knob in the config file controls is suspension will happen on a node. The MAXJOBRETIREMENTTIME knob specifies a time X in seconds that essentially says "do not preempt or kill this job for any reason until it has run for at least X seconds". Both of these knobs are ClassAd expressions that can refer to any attribute in the job ad (including custom attributes).

So for the above, you could put in condor_config.local (on all machines):

    # If job attribute DoNotSuspendJob is explicitly set to True,
    # then do not allow suspend mode. If job does not say one way or
    # the other, allow suspend mode.
    WANT_SUSPEND = DoNotSuspendJob =!= True

    # If job Attribute DoNotKillJob is explicitly set to True, then
    # never interrupt the job unless it has ran for more than
    # 600k seconds (ie a week, just to catch runaway jobs).
    MAXJOBRETIREMENTTIME = 604800 * ( DoNotKillJob =?= True )

After doing the above, don't forget to do "condor_reconfig -all" from your central manager.

Now you could submit jobs w/ a submit file like so (note the + sign to insert a custom attribute):

   # Disable suspend mode, but job will still be preempted when
   # preempt becomes true
   executable = foo
   +DoNotSuspendJob = True
   queue

Or like this:

   # Disable suspend mode AND do not kill/preempt for any reason
   # once job is started, unless job has already run for a week
   executable = bar
   +DoNotSuspendJob = True
   +DoNotKillJob = True
   queue

Take these examples w/ a grain of salt, they are off the top of my head, but should point you in the right direction. You can get more information from the Manual, look up WANT_SUSPEND and/or MAXJOBRETIREMENTTIME in the index.

BTW, maybe instead of DoNotKillJob being True/False, you could just have an integer attribute and allow the job submit file to specify how long it can run w/o any interruption instead of hard-coding to a week...

regards,
Todd