[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] kill jobs which fork too many child processes



In your Condor configuration, do something like this,and pick your max process number:

USER_JOB_WRAPPER = /etc/condor/user_job_wrapper.sh
----------------------
[ball]$ cat /etc/condor/user_job_wrapper.sh
#!/bin/sh

# limit user to 800 processes in this job
ulimit -u 800

# exec the job
exec "$@"

# only get here if exec fails
error=$?
echo "Failed to exec($error): $@" > $_CONDOR_WRAPPER_ERROR_FILE
exit 1

bob

On 3/6/2015 11:35 AM, Werner Hack wrote:
Hi Todd,

thank you for your advice and the helpful link.
I have setup Condor to go cgroups.
But I have concerns about jobs that flood the host with "thousands" of processes, as you said,
and other jobs or even the whole system suffer from this.
Maybe your hint with "NumPids" can help me in this case.
I hope I don't need it since cgroups seems to manage the situation quite well.

Best
Werner


On 03/05/2015 09:46 PM, Todd Tannenbaum wrote:
On 3/5/2015 9:27 AM, Werner Hack wrote:
Hi all,

Is it possible to instruct Condor to kill jobs that create a lot of child processes
and the number of these processes exceed the requested cores for the job
(or even the total amount of cores in the system)?

Sure, I can use ASSIGN_CPU_AFFINITY to restrict these jobs to the amount of cores they have requested.
But nevertheless system load is increasing this way due to process management.
And it's not fair against other users.
So I want to put these "bad" jobs on hold.
How can I do this in Condor?

Hi Werner,

Yes, it is possible to put jobs on hold whose number of processes exceed the requested cores for the job.  But for the record, I don't think it is a good idea. I would strongly encourage you to consider just how many cores the job is actually using, and not worry about how many processes are involved.  Adding a process to the pid table is cheap... or are you dealing the a job that starts tens of thousands of processes?

To configure your pool to limit the core usage of jobs to what they requested, see
  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
I really encourage you to consider one of the recipes on this wiki page.

But if you really really really want to ignore my advice and configure HTCondor to do what you asked above, the trick is the slot ad advertises an attribute called "NumPids".  Combining that with the recipe example on the how-to link above, I came up with the following:

# If the number of processes launched by the job exceeds the
# number of cores requested, immediately put the
# job on hold with a helpful message in job attribute "HoldReason".
# Note that to place the job on hold, we first eliminate any
# retirement time and preempt the job.
#
CPU_EXCEEDED = (NumPids > Cpus)
PREEMPT = ($(PREEMPT:False)) || $(CPU_EXCEEDED)
MAXJOBRETIREMENTTIME = ifthenelse($(CPU_EXCEEDED),0,$(MAXJOBRETIREMENTTIME:0))
WANT_SUSPEND = ($(WANT_SUSPEND:False)) && $(CPU_EXCEEDED) =!= TRUE
WANT_HOLD = ($(WANT_HOLD:False)) || $(CPU_EXCEEDED)
WANT_HOLD_REASON = ifThenElse($(CPU_EXCEEDED), \
     "num of processes exceeded request_cpus", \
     $(WANT_HOLD_REASON:UNDEFINED))

Thanks in advance
Werner

Hope the above helps.

regards,
Todd




      

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/