[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dynamic allocation of RAM

thank you for the suggestion, dimitri. but if i understand correctly what is happening there, a job that exceeds the limit will be put on hold and then rescheduled also if there is the possibility to just increase the request_memory on the same machine. we cannot work with checkpoints here (at least not using htcondor's standard universe), so this would mean that jobs would need to rerun from the very beginning.

if there was a possibility to update the requirements of a job while it is running and after checking whether under these new requirements, the job can remain on the machine, would be great for my use case.

don't get me wrong: yours is a wonderful suggestion and if this extra bit is not possible i will definitely test it!

thanks again,

Am 2016-03-15 um 18:50 schrieb Dimitri Maziuk:
On 03/15/2016 08:24 AM, Thomas Hartmann wrote:

2. handle RAM allocations more dynamically. for instance:
2.1. if a job wants to use more RAM than previously requested, see
whether the machine on which it runs still has this amount of RAM
2.2. if it does, update the request_memory to a safe value and continue
running the job.
2.3. if the extra RAM is not available, stop the job, update the
request_memory to a safe value and put it back into the queue.
Courtesy of Lauren Michael:

2) The below lines added to the submit file will allow the jobs to
self-police MemoryUsage, and will adjust the memory request in response
(though "request_memory" would need to be replaced in the submit file, not
+MemoryUsage = ( 800 ) * 2 / 3
request_memory = ( MemoryUsage ) * 3 / 2
periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) )
periodic_release = (JobStatus == 5) && ((CurrentTime -
EnteredCurrentStatus) > 180) && (HoldReasonCode != 34)

These lines essentially say:
Set the "request_memory" ("RequestMemory" in the job classad) to be a
function of MemoryUsage, and artificially set the MemoryUsage to an initial
value (800 MB * 2/3).
Put the job on hold if the (real) MemoryUsage goes 50% above the current
RequestMemory value.
Release the held job (if held for the memory reason, and held for at least
3 minutes), so that it will be matched to run again on a compute "slot"
with more memory (according to the new RequestMemory value).
We removed "HoldReasonCode != 34" and added "periodic_remove = (time() -
QDate) > 500000" and have been running those jobs for quite some time.
What I can't tell you is how many of them actually use that magic: I
won't dig into that until things break and so far they haven't. (Most of
those jobs run in under 800MB.)

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Dr. Thomas Hartmann

Centre for Cognitive Neuroscience
FB Psychologie
Universität Salzburg
Hellbrunnerstraße 34/II
5020 Salzburg

Tel: +43 662 8044 5109
Email: thomas.hartmann@xxxxxxxx

"I am a brain, Watson. The rest of me is a mere appendix. " (Arthur Conan Doyle)