[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] nodes without cgrouped jobs?



Hi all,

there are ongoing nodes appearing where jobs are not placed into their
own slice but are just  below the main condor cgroup. E.g., one node has
currently 20 jobs running while has only six dedicated sub slices in the
CPU cgroup below the main condor slice [1].
I.e., it looks like that jobs get started without a cgroup getting created
Even more wired, there are slices, that do not have any PIDs assigned -
e.g., on this node slot1_16 has got a sub-slice created [2], it does not
look like that any starter/exec got actually started into the created
sub-group (but ended up again under the parent condor group).

Side note: on some of these nodes we have (ro) bind-mounted
/sys/fs/cgroup into a Singularity container.
However, that should(?!) not affect any Condor process running outside
this container below the root namespace (I assume...) - at least the
bind-mount is not appearing in the root namespace [3]



Cheers,
  Thomas

[1]
> ls -1 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/ | grep
slot | wc -l
6

> wc -l
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*/tasks

0
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_14@xxxxxxxxxxxxxxxxx/tasks
0
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks
53
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_19@xxxxxxxxxxxxxxxxx/tasks
18
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx/tasks
0
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_7@xxxxxxxxxxxxxxxxx/tasks
18
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_9@xxxxxxxxxxxxxxxxx/tasks
89 total


> ps axf | grep starter | grep grid-arcce | wc -l
20

> wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
1078 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks

[2]
> cat
/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks
> echo $?
0

[2.b]
> ps axf | grep -A 5 slot1_16
...
48029 ?        Ss     0:00      \_ condor_starter -f -a slot1_16
grid-arcce0.desy.de
48034 ?        Ss     0:00      |   \_ /bin/bash -l
/var/lib/condor/execute/dir_48029/condor_exec.exe
48089 ?        S      0:00      |       \_ /usr/bin/time -o
/var/lib/condor/execute/dir_48029/5d0KDmvIspsnntDnJpfbFDFoABFKDmABFKDmjFbbDmABFKDmVI3j3m.diag


[2.c]
> grep 48034 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
48034

> grep 48034
/sys/fs/cgroup/cpu\,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks

> echo $?
0

[3]
> findmnt | grep  "\["
ââ/home                                    /dev/sda6[/home] ext4
rw,relatime,data=ordered
ââ/tmp                                     /dev/sda6[/tmp]  ext4
rw,relatime,data=ordered


On 2018-06-07 20:41, Todd Tannenbaum wrote:
> On 6/7/2018 10:44 AM, Thomas Hartmann wrote:
>> Hi all,
>>
>> I just noticed, that a few of our nodes have their jobs not confined in
>> cgroups - i.e., no condor slice at all [1]. These nodes are setup the
>> same and on the same release [2] as the majority of the nodes where the
>> jobs are properly cgrouped.
>> We are going to drain and reboot these nodes, but maybe somebody has an
>> idea, what might have gone wrong here?
>>
>> Cheers,
>>   Thomas
>>
> 
> Hi Thomas,
> 
> Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on these two machines?  The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like:
> 
> # ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_*
> 
> Hope the above helps
> Todd
> 
> 
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature