[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected hold on jobs



Hi Thomas,

 

Thank you for the direction! I can see that there are cgroup limits set on the container:

 

cat /sys/fs/cgroup/memory/docker/64e0545c5ba6297f66a990c7d09502db2ba315ce5141be0bdb3aac031db74b6e/memory.limit_in_bytes = 16777216000

 

Having traced this, I can now see this limit is applied when the container is created. We have a wrapper that builds the docker switches at runtime and the memory limits are defined in this wrapper. The limits are 2x the requested memory from the pilot so that tracks with what we are seeing. Thank you everyone for your help in tracing this issue.

 

Many thanks,

 

Tom

 

On 21/06/2023, 13:47, "Thomas Hartmann" <thomas.hartmann@xxxxxxx> wrote:

Hi Thomas,

 

can you check, if your jobs cgroups have OOM set in their cgroup limits

rather than Condor's memory watchdog?

i.e., if there is a limit set in a process'

   memory.limit_in_bytes

 

e.g., at us it looks like

 

/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_25@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes

but probably your Docker set up is on a different path

 

The path should be under the cgroup mount

   > mount | grep cgroup | grep memory

plus a job's process sub-path from

   > grep memory /proc/{PID}/cgroup

 

Cheers,

   Thomas

 

On 21/06/2023 12.41, Thomas Birkett - STFC UKRI via HTCondor-users wrote:

> Hi Condor Community,

>

> I have an odd issue with a small percentage of jobs we run. We have a

> small subset of jobs that go on hold due to resource being exceeded, for

> example:

>

> LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx

> <mailto:slot1_38@xxxxxxxxxxxxxxxxxxxxxxx>: Docker job has gone over

> memory limit of 4100 Mb"

>

> However, we haven’t configured any resource limits to hold jobs. I also

> notice the only ClassAd that appears to match the memory limit is:

>

> MemoryProvisioned = 4100

>

> These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to

> clear down held jobs. My question to the community is why is the job

> going on hold in the first place? The only configured removal limit /

> PeriodicRemove statement we configure is on a per job level shown below:

>

> PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) ||

> ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)

>

> I cannot replicate this behaviour in my testing, and I cannot find any

> reason why the job went on hold.

>

> Researching the relevant classads, I see:

>

> MemoryProvisioned

>

> The amount of memory in MiB allocated to the job. With

> statically-allocated slots, it is the amount of memory space allocated

> to the slot. With dynamically-allocated slots, it is based upon the job

> attribute RequestMemory, but may be larger due to the minimum given to a

> dynamic slot.

>

> At our site we dynamically assign our slots and the Request memory for

> this job is “RequestMemory = 4096”. I find this even more perplexing as

> this is a very rare issue with over 90% of the jobs working well and

> completing, same job type, same VO, same config. Any assistance

> debugging this issue will be gratefully received.

>

> Many thanks,

>

> *Thomas Birkett*

>

> Senior Systems Administrator

>

> Scientific Computing Department

>

> Science and Technology Facilities Council (STFC)

>

> Rutherford Appleton Laboratory, Chilton, Didcot

> OX11 0QX

>

> signature_609518872

>

>

> _______________________________________________

> HTCondor-users mailing list

> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

> subject: Unsubscribe

> You can also unsubscribe by visiting

>

> The archives can be found at: