Re: [HTCondor-users] spontaneous reboots after enabling cgroups

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Fri, 16 Aug 2013 08:52:56 -0500

From: Brian Bockelman <bbockelm@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] spontaneous reboots after enabling cgroups

On Aug 16, 2013, at 5:30 AM, Chris Filo Gorgolewski <krzysztof.gorgolewski@xxxxxxxxx> wrote:

On Fri, Aug 16, 2013 at 4:04 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

Hi,

A few thoughts-
- Why do you want swap on your worker nodes? We found it much more useful to just disable swap and kill jobs when they went over their memory limit.
Yes this is what I would like to do, but only for the condor jobs.

- You can set the swappiness of the /condor cgroup to 0, disabling swap only for condor jobs and processes.
Ha this would be perfect. However, I was thinking if a job runs out of memory in this set up would it just fail or get preemted and sent back to the pool to be executed later on on a machine with more memory?

When jobs hit memory limits, they are put into the HOLD state with an appropriate message and hold code. If you want them to be re-run, you could use PeriodicHold. If you want the job to be automatically edited by the system - you may want to look at the JobRouter, which can do periodic job edits.

That said, I think it's much better to hold the job and let the user examine it manually. If they aren't informed of the issue, they may never be aware of it (and waste thousands of CPU hours - I've seen it!).

Brian

Mailing List Archives

Public Access

Re: [HTCondor-users] spontaneous reboots after enabling cgroups