[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7

On 10/23/2017 10:07 AM, Alessandra Forti wrote:


Because the (system_)periodic_remove expressions are evaluated by the condor_shadow while the job is running, and the *_RAW attributes are only updated in the condor_schedd.

A simple solution is to use attribute MemoryUsage instead of ResidentSetSize_RAW. So I think things will work as you want if you instead did:

 RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
 SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage) || <OtherParameters>

let me get this straight if I replace ResidentSetSize_RAW with MemoryUsage it should work?

Yes, that is correct, with MemoryUsage it should work.

Also you may want to place your memory limit policy on the execute nodes via startd policy expression, instead of having them enforced on the submit machine (what I think you are calling the head node). The reason is the execute node policy is evaluated every five seconds, while the submit machine policy is evaluated every several minutes.
I read that the submit machine evaluates the expression every 60 seconds since version 7.4 (though admitedly the blog I read is quite old so things might have changed again (https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/)

But realize that there is a lot of polling going on here. The condor_starter on the execute machine (worker node) will poll the operating system for the resource utilization of the job, and send updated job attributes like MemoryUsage to both the condor_startd and the condor_shadow every STARTER_UPDATE_INTERVAL seconds (300secs by default). Then, for a running job, the condor_shadow will evaluate your SYSTEM_PERIODIC_REMOVE expression every PERIODIC_EXPR_INTERVAL seconds (60 by default). The condor_shadow will also push updated job attributes up to the condor_schedd every SHADOW_QUEUE_UPDATE_INTERVAL seconds (900secs by default).

The above polling/update parameters are set how they are by default to limit the update rates to accommodate one schedd managing many thousands of live jobs.

So... given the above, note the default config means your SYSTEM_PERIODIC_REMOVE expression could take up to 5 or 6 minutes before it removes a large memory job. And if you are monitoring MemoryUsage and/or ResidentSetSize job attributes via condor_q, it will take 15 minutes (up to 20 minutes) for condor_q to show a MemoryUsage spike.

A runaway job could consume a lot of memory in a few minutes :).
Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is there another recipe?

Yes, if you have control over the config of the worker node, it may be better to configure the worker node to simply kill a job that exceeds your memory policy instead of waiting for the memory usage information to propagate back to the submit node. The worker node would (by default) kill the job either immediately if using cgroups with the hard memory policy, or within 5 seconds if you want a custom PREEMPT expression that could state things like only kill if the job is using 2x the provisioned memory (still don't understand why you want to allow the job to use twice the memory it requested...).

Hope the above helps,

Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685