[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored



On 8/27/2021 2:42 AM, Stefano Dal Pra wrote:

Experts here might want to confirm: i think that some job classads (such as ResidentSetSize) are actually updated every 15 minutes.
If that is true, that means that this policy could put on hold a job now, based on a value measured up to 15 minutes before.

It is a bit complicated.... 

The condor_starter on the execute node will send updates to the condor_shadow every 5 minutes by default  with dynamic attributes about the job like ResidentSetSize.   How often the starter updates the shadow is controlled via condor_starter config knobs STARTER_UPDATE_INTERVAL and STARTER_INITIAL_UPDATE_INTERVAL (how long until the first update is sent).

Upon receiving an update from the condor_starter, the condor_shadow for the job will evaluate job policy expressions like SYSTEM_PERIODIC_HOLD for running jobs.    Job policy expressions are evaluated/handled by the condor_shadow when a job is running to help offload work from the schedd.

Then, periodically at a lower frequency of every 15 min by default, the condor_shadow will push those updated attributes to the schedd so they are visible via condor_q.   A lower frequency is used here to minimize overloading the schedd when running thousands of jobs. How often the shadow pushes attributes to the schedd is controlled via config knob SHADOW_QUEUE_UPDATE_INTERVAL. 

So, even though you will only see changes to ResidentSetSize every 15 minutes via condor_q, the SYSTEM_PERIODIC_HOLD _expression_ should be operating on values that are no more than 5 minutes old. 

Hope this helps,
Todd