On 03/15/2016 08:24 AM, Thomas Hartmann wrote: > 2. handle RAM allocations more dynamically. for instance: > 2.1. if a job wants to use more RAM than previously requested, see > whether the machine on which it runs still has this amount of RAM > available. > 2.2. if it does, update the request_memory to a safe value and continue > running the job. > 2.3. if the extra RAM is not available, stop the job, update the > request_memory to a safe value and put it back into the queue. Courtesy of Lauren Michael: > 2) The below lines added to the submit file will allow the jobs to > self-police MemoryUsage, and will adjust the memory request in response > (though "request_memory" would need to be replaced in the submit file, not > added). > +MemoryUsage = ( 800 ) * 2 / 3 > request_memory = ( MemoryUsage ) * 3 / 2 > periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) ) > periodic_release = (JobStatus == 5) && ((CurrentTime - > EnteredCurrentStatus) > 180) && (HoldReasonCode != 34) > > These lines essentially say: > Set the "request_memory" ("RequestMemory" in the job classad) to be a > function of MemoryUsage, and artificially set the MemoryUsage to an initial > value (800 MB * 2/3). > Put the job on hold if the (real) MemoryUsage goes 50% above the current > RequestMemory value. > Release the held job (if held for the memory reason, and held for at least > 3 minutes), so that it will be matched to run again on a compute "slot" > with more memory (according to the new RequestMemory value). We removed "HoldReasonCode != 34" and added "periodic_remove = (time() - QDate) > 500000" and have been running those jobs for quite some time. What I can't tell you is how many of them actually use that magic: I won't dig into that until things break and so far they haven't. (Most of those jobs run in under 800MB.) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Attachment:
signature.asc
Description: OpenPGP digital signature