[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] kill jobs which fork too many child processes



On 3/5/2015 9:27 AM, Werner Hack wrote:
> Hi all,
> 
> Is it possible to instruct Condor to kill jobs that create a lot of child processes
> and the number of these processes exceed the requested cores for the job
> (or even the total amount of cores in the system)?
> 
> Sure, I can use ASSIGN_CPU_AFFINITY to restrict these jobs to the amount of cores they have requested.
> But nevertheless system load is increasing this way due to process management.
> And it's not fair against other users.
> So I want to put these "bad" jobs on hold.
> How can I do this in Condor?
> 

Hi Werner,

Yes, it is possible to put jobs on hold whose number of processes exceed the requested cores for the job.  But for the record, I don't think it is a good idea. I would strongly encourage you to consider just how many cores the job is actually using, and not worry about how many processes are involved.  Adding a process to the pid table is cheap... or are you dealing the a job that starts tens of thousands of processes?

To configure your pool to limit the core usage of jobs to what they requested, see
  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
I really encourage you to consider one of the recipes on this wiki page.

But if you really really really want to ignore my advice and configure HTCondor to do what you asked above, the trick is the slot ad advertises an attribute called "NumPids".  Combining that with the recipe example on the how-to link above, I came up with the following:

# If the number of processes launched by the job exceeds the
# number of cores requested, immediately put the
# job on hold with a helpful message in job attribute "HoldReason".
# Note that to place the job on hold, we first eliminate any
# retirement time and preempt the job.
#
CPU_EXCEEDED = (NumPids > Cpus)
PREEMPT = ($(PREEMPT:False)) || $(CPU_EXCEEDED)
MAXJOBRETIREMENTTIME = ifthenelse($(CPU_EXCEEDED),0,$(MAXJOBRETIREMENTTIME:0))
WANT_SUSPEND = ($(WANT_SUSPEND:False)) && $(CPU_EXCEEDED) =!= TRUE
WANT_HOLD = ($(WANT_HOLD:False)) || $(CPU_EXCEEDED)
WANT_HOLD_REASON = ifThenElse($(CPU_EXCEEDED), \
     "num of processes exceeded request_cpus", \
     $(WANT_HOLD_REASON:UNDEFINED))

> Thanks in advance
> Werner
>

Hope the above helps.

regards,
Todd