[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Priority calculation: memory



Mathieu,

Having used partitionable slots since the first installation of our pool back in February 2013, I can jump in here with some useful information for you.

Our setup is a basic partitionable slot config in which there's only one type-1 slot which advertises 100% of cpu, memory, and disk. We used it on every member of the pool until recently, when I ran into trouble getting parallel universe jobs to cooperate with partitionable slots and so I set up a handful of machines with static slots to support MATLAB Parallel Computing Toolbox MPI communicating jobs for distributed arrays, parfor, and the like.

All the pool members are RHEL6 systems, so they use the cgroups system for resource tracking, which is quite spiffy.

We had to go with partitionable slots because of the wide array of jobs that needed to be run were so diverse. Some jobs needed 500MB of memory, and others might need 20GB, depending on the nature of the scenario. Some of the more fancy jobs needed multiple cpus, and some of the continuous integration build scripts ran "make -j 8" for 8 compile threads. We've even got some jobs using GPUs, again with a wide range of memory requirements. We also have stacks of MATLAB jobs, under a few different versions of MATLAB, again with a wide range of memory requirements. Partitionable slots was really the only choice for us.

"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx> wrote on 09/04/2015 03:05:37 AM:

> From: Mathieu Bahin <mathieu.bahin@xxxxxxxxxxxxxxx>

> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Date: 09/04/2015 03:06 AM
> Subject: Re: [HTCondor-users] Priority calculation: memory
> Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
>
> Thanks Greg for this quick and precise answer, maybe we won't take the
> risk to adjust that then.
>
> Actually, we wonder how things will be with the partitionable slots.
> From what we understand:
>   - a default max memory is allocated to the job if nothing special is
> specified
>   - if the job exceed this memory, the job is aborted


By default, jobs exceeding the memory request are not aborted, you need to write a periodic hold _expression_ to do that. Page 211 of the 8.2.9 manual shows an example of how to do it. The cgroup_memory_limit_policy governs how memory allocations are handled, but doesn't impact eviction of jobs - see page 243 of the 8.2.9 manual. Docker universe jobs in 8.3, and the upcoming 8.4, do have an lethal electric fence for memory allocations, however.

In RHEL6 and up, the kernel has an "out of memory killer" which protects the system from crashing when it exhausts all physical memory, so that's the main defense of the system we use in our pools. Due to the OOM killer's "principle of least surprise," it's always going to be the overly bloated process which gets nailed, and it shows up in the syslog so it's easy to diagnose. Under RHEL5 without the OOM killer, the system would either panic or thrash in swap space, so it was virtually impossible to figure out what went wrong.

Since we were migrating from Grid Engine, it would have been far too disruptive to kill jobs exceeding our default 1GB memory allocation, because at the outset virtually none of the jobs or the users submitting them had any idea of how much memory the jobs needed. It was pretty routine for Grid Engine to fire up 24 jobs on 24 cores which needed 10GB each, on a system with 48GB of physical memory, and it took me two weeks to figure out how to configure it to treat memory as a consumable resource. Occasionally a job tried to allocate dozens or hundreds of TERAbytes of memory by allocating an array based on dimensions in uninitialized variables, and it would dutifully suck down gigabyte after gigabyte of physical memory for several minutes until the machine hung or crashed, and nobody could figure out that root cause until HTCondor, cgroups, and OOM killer came along.

Now that more and more users are getting the hang of everything, the memory requests are much more accurate and the partitionable slots work like a charm.

One interesting thing you can do is set up your RequestMemory _expression_ to vary based on NumJobStarts - if a job gets OOM-killed, when it restarts its NumJobStarts will be greater than one, and you can use a ClassAd expresson to increase the amount of memory requested by the job the second time it runs.

Adjusting slot weight isn't "risky" in the traditional sense - the issue is that you need to come up with a mathematical _expression_ that arrives at a fair assessment of the user's resource use that works across a range of values. For instance, someone who uses a single CPU and 32GB of memory on a machine which has 512GB of memory is not having the same kind of impact on pending jobs as someone who's using 32 CPUs with 1GB of memory each on six different machines. So in order to assess the utilization of those two users fairly, it might to be difficult to get it just right. I've never changed the slot_weight in any of our pools, and it's been rare to encounter situations where I thought about doing it.

I could imagine doing something based on the amount of physical memory per available CPU core - detected memory divided by detected cores, and then charge someone for two slot weights if they use one CPU core but two cores worth of memory. You'd probably want to use NUM_CPUS rather than detected cores, since not all cores may be advertised, and with reserved_memory not all memory might be either, i.e., memory_per_core = ( $(DETECTED_MEMORY) - $(RESERVED_MEMORY) ) / $(NUM_CPUS) -- but then if you're advertising more cores than are available such as for Todd's suspendable-slot example, you'd have to finagle that further. Would you want to charge them two if they only used one and a half cpu's worth of memory? Etc, etc...


> The cluster is composed of machines with very different caracteristics
> (memory from [8G, 8 cores] to [192G, 16 cores]) so it's not easy to
> setup a default memory.


It's not so much the characteristics of the machines, but the characteristics of the typical jobs, which guide the choice of a default memory size.

We went with 1GB, since majority of the jobs run about 500-1000 megabytes of physical memory. Even though most of the machines had 4GB per CPU core, the 1GB default carved out enough of an allocation that it limited the impact of a job or two ballooning unexpectedly - on a 24-core/96GB machine if you've got 23 jobs behaving well and one job ballooning, the 24th job could grow up to 70GB of physical memory before disturbing the OOM killer and running the risk of execution. Remember, the default memory request is not a fatal barrier to jobs by default.

As you get the hang of things, and look at the UserLog files for the "requested / used" numbers reported at job completion, you'll be able to help your users dial in the right number for their memory requests.

For the 8G/8-core machine, a 1GB limit would never match the 8th slot, because due to overhead the pslot won't advertise a full 8096 MB, so you may want to go with 750MB instead. If it's a desktop system, however, you'll want to leave that 8th slot unoccupied anyway, since I've found desktop machines can get a bit thrashed with every core running a CPU-intensive job, even at nice -19.

> What we are afraid of is that users, tired with having jobs aborted,
> always request a very large amount of memory for their jobs.
>
> Have we misunderstood something? Do you have some advice about that?

I hope the above is just what you need.

It's interesting to watch the graph of claimed/busy slots when jobs which request larger amounts of memory are queued up - the systems with 24 cores and 96GB only get 12 8GB jobs, thus half the cores are idle so the graph drops off, but the pool is still fully utilized.

Good luck!

        -Michael Pelletier.