[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected hold on jobs



The hold in question is coming from Docker itself.  It appears that you are running all your worker node jobs with WANT_Docker. (as we do here at Fermilab) and running them inside a Docker container which by default is set to have the memory equal or slightly greater to RequestMemory.
The hold comes because docker detects you've gone over the memory limit and terminates the container.

At Fermilab we view this as a feature because Docker is much more prompt about clipping off jobs that are running high memory than a condor_schedd would be.. typically that metric lags by several minutes and if you have a bad memory leak a job can take over the whole machine in that time.

Steve Timm



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, June 21, 2023 5:41 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Unexpected hold on jobs
 

Hi Condor Community,

 

I have an odd issue with a small percentage of jobs we run. We have a small subset of jobs that go on hold due to resource being exceeded, for example:

 

LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx: Docker job has gone over memory limit of 4100 Mb"

 

However, we haven’t configured any resource limits to hold jobs. I also notice the only ClassAd that appears to match the memory limit is:

 

MemoryProvisioned = 4100

 

These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to clear down held jobs. My question to the community is why is the job going on hold in the first place? The only configured removal limit / PeriodicRemove statement we configure is on a per job level shown below:

 

PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)

 

I cannot replicate this behaviour in my testing, and I cannot find any reason why the job went on hold.

 

Researching the relevant classads, I see:

 

MemoryProvisioned

The amount of memory in MiB allocated to the job. With statically-allocated slots, it is the amount of memory space allocated to the slot. With dynamically-allocated slots, it is based upon the job attribute RequestMemory, but may be larger due to the minimum given to a dynamic slot.

At our site we dynamically assign our slots and the Request memory for this job is “RequestMemory = 4096”. I find this even more perplexing as this is a very rare issue with over 90% of the jobs working well and completing, same job type, same VO, same config. Any assistance debugging this issue will be gratefully received.

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872