[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor cgroup hard setting related queries



HelloÂExperts,

Any inputs on this issue?

Thanks & Regards,
Vikrant Aggarwal


On Sat, Feb 27, 2021 at 9:58 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,Â

My query is related to another discussion [1] which happened recently. I brought this issue [2] earlier also now we hit it again.

relevant logs:Â

# condor_q -held -af:h HoldReasonCode HoldReasonSubCode JobRunCount HoldReasonCode
HoldReasonCode HoldReasonSubCode JobRunCount HoldReasonCode
undefined undefined 1 undefined

From executor node logs:

02/25/21 14:16:49 (pid:3029933) Job was held due to OOM event: Job has gone over memory limit of 4726 megabytes. Peak usage: 4712 megabytes.
02/25/21 14:16:49 (pid:3029933) Got SIGQUIT. Performing fast shutdown.

From sched log file:

02/25/21 14:21:49 (pid:3088280) Shadow pid 3087586 for job 8969151.0 exited with status 112
02/25/21 14:21:49 (pid:3088280) Putting job 8969151.0 on hold
02/25/21 14:21:49 (pid:3088280) Scheduler::WriteHoldToUserLog(): Failed to get HoldReason from job 8969151.0

From recent discussion, it seems like we have to back cgroup by SYSTEM_PERIODIC_HOLD so that it can return a hold reason and codes back to submit node. right understanding?Â

All tests with partitionable slots: We have expressions in-place to guarantee min memorypercore. Size of the test file was 8GB.Â

## Test with condor 8.5.8 (Dev Release) and RHEL 6.10 (I know cgroup doesn't work perfectly with RHEL 6)

Test 1 : Simple bash script writing directly to /dev/shm. Job went into held status with all relevant Holdreason codes without using ÂSYSTEM_PERIODIC_HOLD on the worker node. Â

$ condor_q -held -af:h HoldReasonCode HoldReasonSubCode JobRunCount HoldReasonCode
HoldReasonCode HoldReasonSubCode JobRunCount HoldReasonCode
34 Â Â Â Â Â Â 0 Â Â Â Â Â Â Â Â 1 Â Â Â Â Â 34

Test 2 : Instead of writing directly to /dev/shm, create file in scratch directory and then copy it to /dev/shm. In this case, job gets completed instead of going into held status but it only copies a partial file. exit code of job was 1.Â

## Test with condor 8.8.5 (Stable release) and CentOS Linux release 7.9.2009

Test 1 & Test 2: In both cases below message reported in slot log file but job was infinitely in running state until manual action taken.Â

Spurious OOM event, usage is 2, slot size is 5399 megabytes, ignoring OOM (read 8 bytes)


Questions:

- If we really need to use SYSTEM_PERIODIC_HOLD with cgroup hard setting then what would be the right _expression_ for partitionable slots?
- Why is the centos7 node job either marked as completed or held despite breaching the mem limit like it did in rhel6 setup?
- Why in some cases results are completely empty for hold reason codes and in some they are returned successfully from the executor node?
- Is't okay to use WANT_HOLD and SYSTEM_PERIODIC_HOLD both together? We are currently using WANT_HOLD to hold jobs if they are running more than stipulated time.

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-February/msg00123.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2019-August/msg00064.shtml


Thanks & Regards,
Vikrant Aggarwal