[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dynamic allocation of RAM



Hi Thomas,

How are you killing off jobs?  Are you using cgroups for enforcement?

If so, it might be a good idea to look at the SOFT enforcement in combination with what Dimitri suggested.  This allows jobs to go over limits until the machine is out of memory.  In that case, the over-limit jobs are killed first.

However, I do suggest to keep the recipe from Dimitri / Lauren - you don’t want to give users an unlimited pass because otherwise they never understand their real memory requirements.

Brian

On Mar 15, 2016, at 2:45 PM, Thomas Hartmann <thomas.hartmann@xxxxxxxx> wrote:

thank you for the suggestion, dimitri. but if i understand correctly what is happening there, a job that exceeds the limit will be put on hold and then rescheduled also if there is the possibility to just increase the request_memory on the same machine. we cannot work with checkpoints here (at least not using htcondor's standard universe), so this would mean that jobs would need to rerun from the very beginning.

if there was a possibility to update the requirements of a job while it is running and after checking whether under these new requirements, the job can remain on the machine, would be great for my use case.

don't get me wrong: yours is a wonderful suggestion and if this extra bit is not possible i will definitely test it!

thanks again,
thomas

Am 2016-03-15 um 18:50 schrieb Dimitri Maziuk:
On 03/15/2016 08:24 AM, Thomas Hartmann wrote:

2. handle RAM allocations more dynamically. for instance:
2.1. if a job wants to use more RAM than previously requested, see
whether the machine on which it runs still has this amount of RAM
available.
2.2. if it does, update the request_memory to a safe value and continue
running the job.
2.3. if the extra RAM is not available, stop the job, update the
request_memory to a safe value and put it back into the queue.
Courtesy of Lauren Michael:

2) The below lines added to the submit file will allow the jobs to
self-police MemoryUsage, and will adjust the memory request in response
(though "request_memory" would need to be replaced in the submit file, not
added).
+MemoryUsage = ( 800 ) * 2 / 3
request_memory = ( MemoryUsage ) * 3 / 2
periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) )
periodic_release = (JobStatus == 5) && ((CurrentTime -
EnteredCurrentStatus) > 180) && (HoldReasonCode != 34)

These lines essentially say:
Set the "request_memory" ("RequestMemory" in the job classad) to be a
function of MemoryUsage, and artificially set the MemoryUsage to an initial
value (800 MB * 2/3).
Put the job on hold if the (real) MemoryUsage goes 50% above the current
RequestMemory value.
Release the held job (if held for the memory reason, and held for at least
3 minutes), so that it will be matched to run again on a compute "slot"
with more memory (according to the new RequestMemory value).
We removed "HoldReasonCode != 34" and added "periodic_remove = (time() -
QDate) > 500000" and have been running those jobs for quite some time.
What I can't tell you is how many of them actually use that magic: I
won't dig into that until things break and so far they haven't. (Most of
those jobs run in under 800MB.)



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Dr. Thomas Hartmann

Centre for Cognitive Neuroscience
FB Psychologie
Universität Salzburg
Hellbrunnerstraße 34/II
5020 Salzburg

Tel: +43 662 8044 5109
Email: thomas.hartmann@xxxxxxxx

"I am a brain, Watson. The rest of me is a mere appendix. " (Arthur Conan Doyle)
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/