[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unexpected hold on jobs



Hi Condor Community,

 

I have an odd issue with a small percentage of jobs we run. We have a small subset of jobs that go on hold due to resource being exceeded, for example:

 

LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx: Docker job has gone over memory limit of 4100 Mb"

 

However, we haven’t configured any resource limits to hold jobs. I also notice the only ClassAd that appears to match the memory limit is:

 

MemoryProvisioned = 4100

 

These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to clear down held jobs. My question to the community is why is the job going on hold in the first place? The only configured removal limit / PeriodicRemove statement we configure is on a per job level shown below:

 

PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)

 

I cannot replicate this behaviour in my testing, and I cannot find any reason why the job went on hold.

 

Researching the relevant classads, I see:

 

MemoryProvisioned

The amount of memory in MiB allocated to the job. With statically-allocated slots, it is the amount of memory space allocated to the slot. With dynamically-allocated slots, it is based upon the job attribute RequestMemory, but may be larger due to the minimum given to a dynamic slot.

At our site we dynamically assign our slots and the Request memory for this job is “RequestMemory = 4096”. I find this even more perplexing as this is a very rare issue with over 90% of the jobs working well and completing, same job type, same VO, same config. Any assistance debugging this issue will be gratefully received.

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872