thank you for the suggestion, dimitri. but if i understand correctly
what is happening there, a job that exceeds the limit will be put on
hold and then rescheduled also if there is the possibility to just
increase the request_memory on the same machine. we cannot work with
checkpoints here (at least not using htcondor's standard universe),
so this would mean that jobs would need to rerun from the very
beginning. if there was a possibility to update the requirements of a job while it is running and after checking whether under these new requirements, the job can remain on the machine, would be great for my use case. don't get me wrong: yours is a wonderful suggestion and if this extra bit is not possible i will definitely test it! thanks again, thomas Am 2016-03-15 um 18:50 schrieb Dimitri
Maziuk:
On 03/15/2016 08:24 AM, Thomas Hartmann wrote:2. handle RAM allocations more dynamically. for instance: 2.1. if a job wants to use more RAM than previously requested, see whether the machine on which it runs still has this amount of RAM available. 2.2. if it does, update the request_memory to a safe value and continue running the job. 2.3. if the extra RAM is not available, stop the job, update the request_memory to a safe value and put it back into the queue.Courtesy of Lauren Michael:2) The below lines added to the submit file will allow the jobs to self-police MemoryUsage, and will adjust the memory request in response (though "request_memory" would need to be replaced in the submit file, not added). +MemoryUsage = ( 800 ) * 2 / 3 request_memory = ( MemoryUsage ) * 3 / 2 periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) ) periodic_release = (JobStatus == 5) && ((CurrentTime - EnteredCurrentStatus) > 180) && (HoldReasonCode != 34) These lines essentially say: Set the "request_memory" ("RequestMemory" in the job classad) to be a function of MemoryUsage, and artificially set the MemoryUsage to an initial value (800 MB * 2/3). Put the job on hold if the (real) MemoryUsage goes 50% above the current RequestMemory value. Release the held job (if held for the memory reason, and held for at least 3 minutes), so that it will be matched to run again on a compute "slot" with more memory (according to the new RequestMemory value).We removed "HoldReasonCode != 34" and added "periodic_remove = (time() - QDate) > 500000" and have been running those jobs for quite some time. What I can't tell you is how many of them actually use that magic: I won't dig into that until things break and so far they haven't. (Most of those jobs run in under 800MB.) -- Dr. Thomas Hartmann Centre for Cognitive Neuroscience FB Psychologie Universität Salzburg Hellbrunnerstraße 34/II 5020 Salzburg Tel: +43 662 8044 5109 Email: thomas.hartmann@xxxxxxxx "I am a brain, Watson. The rest of me is a mere appendix. " (Arthur Conan Doyle) |