[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups: shares failed to be set & overbooking of a node



Hi,

we have two things observed with cgroups for managing jobs and we would
like to ask, if somebody has seen something similar as well?

- every now and then jobs get no cgroup shares set resulting in a larger
(relative) CPU and memory shares than supposed.
E.g., a share of ~20.4% for an 8 core job on a 48 core node - while its
share should be 16.7% (same for mem shares)
[details attached]

Apparently, condor could not apply cgroup limits for such a job so that
the defaults got used on the node [SL 6.7, Condor 8.4.8]

While it seems to be a transient problem, such jobs seem to appear every
now and then on our nodes.


- on another node, we observed an overbooking, i.e., seven 8core jobs on
a 48 core machine resulting in relative shares of ~14.3% [6]

Here, the node has a longer running job and else had run empty following
problems with our schedd [7].
After restarting the scheduler, the node's startd got new jobs but (as a
guess) may have not take into account the still running job?
So, I would assume that the overbooking was just a glitch, or?



Cheers,
  Thomas

---------------------

[details]

The general htcondor cgroup has a share of 1024, i.e. 100% since its the
only group on the node [1,2].
While within the cgroup 5 jobs had a CPU share of 800 each, this one job
in dynamic slot1_2 had a share of 1024 of the overall group's share. So
it got a share of ~20.4% vs ~15.9% for each of the other jobs on a node
with 48 cores.

The jobs had been submitted via an ARC CE requesting each 8 cores [3].
With the six jobs on the node, I would have expected a nominally cpu
share of 16.7% per job.

While the startd tried to set cgroup shares [5] it failed for each
resource with '50016 Invalid argument'. So far I have found only this
question relating condor

https://www-auth.cs.wisc.edu/lists/htcondor-users/2014-November/threads.shtml#00023

and a bug report on the error message for libvirt - but I do not an
obvious relationship to condor, or??



[1]
>
/cgroup/cpu/htcondor/condor_var_lib_condor_execute_slot1_2\@batch0209.desy.de/cpu.shares

1024

> cat /cgroup/cpu/htcondor/condor_var_lib_condor_execute_*/cpu.shares
800
1024
800
800
800
800

[2]
> cat /cgroup/cpu/htcondor/cpu.shares
1024

> cat /cgroup/cpu/cpu.shares
1024

[3]
> cat
/var/lib/condor/execute/dir_3388323/VUaNDm1B0FpnvxDnJpv7dLGoABFKDmABFKDmNaWKDmCBFKDmEzacam.diag
runtimeenvironments=ENV/GLITE;ENV/PROXY;
nodename=batch0209.desy.de
Processors=8

[4]
drwxr-xr-x 2 root root 0 Oct 17 10:21
condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:22
condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_5@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:24
condor_var_lib_condor_execute_slot1_6@xxxxxxxxxxxxxxxxx


[5]
 grep "10/17/16 10:22" /var/log/condor/StarterLog.slot1_2
10/17/16 10:22:34 (pid:3394100)
******************************************************
10/17/16 10:22:34 (pid:3394100) ** condor_starter (CONDOR_STARTER)
STARTING UP
10/17/16 10:22:34 (pid:3394100) ** /usr/sbin/condor_starter
10/17/16 10:22:34 (pid:3394100) ** SubsystemInfo: name=STARTER
type=STARTER(8) class=DAEMON(1)
10/17/16 10:22:34 (pid:3394100) ** Configuration: subsystem:STARTER
local:<NONE> class:DAEMON
10/17/16 10:22:34 (pid:3394100) ** $CondorVersion: 8.4.8 Jun 30 2016
BuildID: 373513 $
10/17/16 10:22:34 (pid:3394100) ** $CondorPlatform: x86_64_RedHat6 $
10/17/16 10:22:34 (pid:3394100) ** PID = 3394100
10/17/16 10:22:34 (pid:3394100) ** Log last touched 10/17 03:15:29
10/17/16 10:22:34 (pid:3394100)
******************************************************
10/17/16 10:22:34 (pid:3394100) Using config source:
/etc/condor/condor_config
10/17/16 10:22:34 (pid:3394100) Using local config sources:
10/17/16 10:22:34 (pid:3394100)    /etc/condor/config.d/00worker.conf
10/17/16 10:22:34 (pid:3394100)    /etc/condor/config.d/01grid.conf
10/17/16 10:22:34 (pid:3394100)    /etc/condor/config.d/20rebooter.conf
10/17/16 10:22:34 (pid:3394100)    /etc/condor/condor_config.local
10/17/16 10:22:34 (pid:3394100) config Macros = 100, Sorted = 99,
StringBytes = 3219, TablesBytes = 3672
10/17/16 10:22:34 (pid:3394100) CLASSAD_CACHING is OFF
10/17/16 10:22:34 (pid:3394100) Daemon Log is logging: D_ALWAYS D_ERROR
10/17/16 10:22:34 (pid:3394100) SharedPortEndpoint: waiting for
connections to named socket 7920_1b3b_14789
10/17/16 10:22:34 (pid:3394100) DaemonCore: command socket at
<131.169.160.40:9620?addrs=131.169.160.40-9620&noUDP&sock=7920_1b3b_14789>
10/17/16 10:22:34 (pid:3394100) DaemonCore: private command socket at
<131.169.160.40:9620?addrs=131.169.160.40-9620&noUDP&sock=7920_1b3b_14789>
10/17/16 10:22:34 (pid:3394100) Communicating with shadow
<131.169.223.111:9620?addrs=131.169.223.111-9620&noUDP&sock=1950804_a55a_233>
10/17/16 10:22:34 (pid:3394100) Submitting machine is "grid-arcce1.desy.de"
10/17/16 10:22:34 (pid:3394100) setting the orig job name in starter
10/17/16 10:22:34 (pid:3394100) setting the orig job iwd in starter
10/17/16 10:22:34 (pid:3394100) Chirp config summary: IO false, Updates
false, Delayed updates true.
10/17/16 10:22:34 (pid:3394100) Initialized IO Proxy.
10/17/16 10:22:34 (pid:3394100) Done setting resource limits
10/17/16 10:22:34 (pid:3394100) File transfer completed successfully.
10/17/16 10:22:35 (pid:3394100) Job 261123.0 set to execute immediately
10/17/16 10:22:35 (pid:3394100) Starting a VANILLA universe job with ID:
261123.0
10/17/16 10:22:35 (pid:3394100) IWD: /var/lib/condor/execute/dir_3394100
10/17/16 10:22:35 (pid:3394100) Output file:
/var/lib/condor/execute/dir_3394100/_condor_stdout
10/17/16 10:22:35 (pid:3394100) Error file:
/var/lib/condor/execute/dir_3394100/_condor_stdout
10/17/16 10:22:35 (pid:3394100) Renice expr "0" evaluated to 0
10/17/16 10:22:35 (pid:3394100) About to exec
/var/lib/condor/execute/dir_3394100/condor_exec.exe
10/17/16 10:22:35 (pid:3394100) Running job as user cmsger014
10/17/16 10:22:35 (pid:3394100) Create_Process succeeded, pid=3394104
10/17/16 10:22:35 (pid:3394100) Limiting (soft) memory usage to
21072183296 bytes
10/17/16 10:22:35 (pid:3394100) Limiting (hard) memory usage to
143033016320 bytes
10/17/16 10:22:35 (pid:3394100) Unable to commit memory soft limit for
htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx : 50016
Invalid argument
10/17/16 10:22:35 (pid:3394100) Limiting memsw usage to 143033020416 bytes
10/17/16 10:22:35 (pid:3394100) Unable to commit memsw limit for
htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx : 50016
Invalid argument
10/17/16 10:22:35 (pid:3394100) Unable to commit CPU shares for
htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx: 50016
Invalid argument


[6]
[root@batch0216 ~]# cat
/cgroup/cpu/htcondor/condor_var_lib_condor_execute_slot1_*/cpu.shares
800
800
800
800
800
800
800

[7]
ls -call /cgroup/cpu/htcondor/ | sort -k8
drwxr-xr-x 2 root root 0 Oct 16 01:13
condor_var_lib_condor_execute_slot1_8@xxxxxxxxxxxxxxxxx
drwxr-xr-x 9 root root 0 Oct 17 03:15 .
-rw-rw-r-- 1 root root 0 Oct 17 03:15 tasks
drwxr-xr-x 2 root root 0 Oct 17 10:21
condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:22
condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:23
condor_var_lib_condor_execute_slot1_5@xxxxxxxxxxxxxxxxx
drwxr-xr-x 2 root root 0 Oct 17 10:24
condor_var_lib_condor_execute_slot1_6@xxxxxxxxxxxxxxxxx

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature