[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor cgroup hard setting related queries



Hi Greg,

Thanks for your response. In this case I don't think in this case, one job OOM impacts others. However I tried it didn't make any difference.Â

Job trying to write on /dev/shm using cp fails with the following error. Job is marked as completed in job history.Â

03/02/21 12:10:38 (pid:521437) Using wrapper /usr/local/sbin/os/condor_ldpreload_wrapper.sh to exec /spare/condor/dir_521437/condor_exec.exe
03/02/21 12:10:38 (pid:521437) Create_Process succeeded, pid=521648
03/02/21 12:10:38 (pid:521437) Limiting (soft) memory usage to 0 bytes
03/02/21 12:10:38 (pid:521437) Limiting memsw usage to 9223372036854775807 bytes
03/02/21 12:10:38 (pid:521437) Limiting (hard) memory usage to 5661261824 bytes
03/02/21 12:10:38 (pid:521437) Limiting memsw usage to 135080783872 bytes
03/02/21 12:12:27 (pid:521437) Process exited, pid=521648, status=1

Job trying to write directly on /dev/shm goes into held status with Job has gone over the memory limit message.Â

03/02/21 12:16:21 (pid:563863) Got SIGTERM. Performing graceful shutdown.
03/02/21 12:16:21 (pid:563863) ShutdownGraceful all jobs.
03/02/21 12:16:21 (pid:563863) Process exited, pid=564098, signal=9
03/02/21 12:16:21 (pid:563863) Last process exited, now Starter is exiting

Question: In what scenarios held jobs can't get the holdreasoncode from startd?

02/25/21 14:21:49 (pid:3088280) Shadow pid 3087586 for job 8969151.0 exited with status 112
02/25/21 14:21:49 (pid:3088280) Putting job 8969151.0 on hold
02/25/21 14:21:49 (pid:3088280) Scheduler::WriteHoldToUserLog(): Failed to get HoldReason from job 8969151.0

Thanks & Regards,
Vikrant Aggarwal


On Tue, Mar 2, 2021 at 10:13 PM Greg Thain <gthain@xxxxxxxxxxx> wrote:


On 3/2/21 3:52 AM, ervikrant06@xxxxxxxxx wrote:



## Test with condor 8.8.5 (Stable release) and CentOS Linux release 7.9.2009

Test 1 & Test 2: In both cases below message reported in slot log file but job was infinitely in running state until manual action taken.Â

Spurious OOM event, usage is 2, slot size is 5399 megabytes, ignoring OOM (read 8 bytes)


Hi Vikram:

I believe there were some bugs in cgroup OOM handling in older condor versions. Can you try with the setting

IGNORE_LEAF_OOM = false?

-greg



Questions:

- If we really need to use SYSTEM_PERIODIC_HOLD with cgroup hard setting then what would be the right _expression_ for partitionable slots?
- Why is the centos7 node job either marked as completed or held despite breaching the mem limit like it did in rhel6 setup?
- Why in some cases results are completely empty for hold reason codes and in some they are returned successfully from the executor node?
- Is't okay to use WANT_HOLD and SYSTEM_PERIODIC_HOLD both together? We are currently using WANT_HOLD to hold jobs if they are running more than stipulated time.

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-February/msg00123.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2019-August/msg00064.shtml


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/