[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected hold on jobs



Hi Thomas,

can you check, if your jobs cgroups have OOM set in their cgroup limits rather than Condor's memory watchdog?
i.e., if there is a limit set in a process'
  memory.limit_in_bytes

e.g., at us it looks like

/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_25@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes
but probably your Docker set up is on a different path

The path should be under the cgroup mount
  > mount | grep cgroup | grep memory
plus a job's process sub-path from
  > grep memory /proc/{PID}/cgroup

Cheers,
  Thomas

On 21/06/2023 12.41, Thomas Birkett - STFC UKRI via HTCondor-users wrote:
Hi Condor Community,

I have an odd issue with a small percentage of jobs we run. We have a small subset of jobs that go on hold due to resource being exceeded, for example:

LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx <mailto:slot1_38@xxxxxxxxxxxxxxxxxxxxxxx>: Docker job has gone over memory limit of 4100 Mb"

However, we havenât configured any resource limits to hold jobs. I also notice the only ClassAd that appears to match the memory limit is:

MemoryProvisioned = 4100

These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to clear down held jobs. My question to the community is why is the job going on hold in the first place? The only configured removal limit / PeriodicRemove statement we configure is on a per job level shown below:

PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)

I cannot replicate this behaviour in my testing, and I cannot find any reason why the job went on hold.

Researching the relevant classads, I see:

MemoryProvisioned

The amount of memory in MiB allocated to the job. With statically-allocated slots, it is the amount of memory space allocated to the slot. With dynamically-allocated slots, it is based upon the job attribute RequestMemory, but may be larger due to the minimum given to a dynamic slot.

At our site we dynamically assign our slots and the Request memory for this job is âRequestMemory = 4096â. I find this even more perplexing as this is a very rare issue with over 90% of the jobs working well and completing, same job type, same VO, same config. Any assistance debugging this issue will be gratefully received.

Many thanks,

*Thomas Birkett*

Senior Systems Administrator

Scientific Computing Department

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX

signature_609518872


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature