[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] nodes without cgrouped jobs?



Hi Todd,

the base group is '/system.slice/condor.service' with the cgroup mounts
under '/sys/fs/cgroup' (CentOS7 with 3.10.0 kernels).
That's where I checked for the condor (sub)slices - e.g., [1,2] where I
found slices without any tasks assigned but which by name correspond to
a running job. I saw also running jobs without a corresponding cgroup
slice placed under the cpu* or memory controller.

I am just worrying about the inconstistent behaviour as mostly jobs are
running fine with their PID tree confined in cgroup resource slices.

Cheers,
  Thomas

On 2018-06-20 19:54, Todd Tannenbaum wrote:
> On 6/20/2018 9:37 AM, Thomas Hartmann wrote:
>> Hi all,
>>
>> there are ongoing nodes appearing where jobs are not placed into their
>> own slice but are just  below the main condor cgroup. E.g., one node has
>> currently 20 jobs running while has only six dedicated sub slices in the
>> CPU cgroup below the main condor slice [1].
> 
> Hi Thomas,
> 
> My response to the above is the same as last time (repeated below for convenience):
> 
> Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on the execute machines where you suspect problems?  The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like:
> 
>   # ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_*
> 
> So.... assuming "condor_config_val BASE_CGROUP" reports the default value of "htcondor", then I don't understand what you are trying to understand by looking at the "/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/" subdirectory (managed by systemd?) vs the directory HTCondor is managing which would be (by default) "/sys/fs/cgroup/cpu,cpuacct/htcondor/" subdirectory.
> 
> If the value of "condor_config_val BASE_CGROUP" is something other than "htcondor", you should find the person who changed that value in your configuration and ask them what they were trying to accomplish.  Or get rid of that change and go with HTCondor's defaults.
> 
> Hope the above helps,
> Todd
> 
> 
>> I.e., it looks like that jobs get started without a cgroup getting created
>> Even more wired, there are slices, that do not have any PIDs assigned -
>> e.g., on this node slot1_16 has got a sub-slice created [2], it does not
>> look like that any starter/exec got actually started into the created
>> sub-group (but ended up again under the parent condor group).
>>
>> Side note: on some of these nodes we have (ro) bind-mounted
>> /sys/fs/cgroup into a Singularity container.
>> However, that should(?!) not affect any Condor process running outside
>> this container below the root namespace (I assume...) - at least the
>> bind-mount is not appearing in the root namespace [3]
>>
>>
>>
>> Cheers,
>>    Thomas
>>
>> [1]
>>> ls -1 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/ | grep
>> slot | wc -l
>> 6
>>
>>> wc -l
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*/tasks
>>
>> 0
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_14@xxxxxxxxxxxxxxxxx/tasks
>> 0
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks
>> 53
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_19@xxxxxxxxxxxxxxxxx/tasks
>> 18
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx/tasks
>> 0
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_7@xxxxxxxxxxxxxxxxx/tasks
>> 18
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_9@xxxxxxxxxxxxxxxxx/tasks
>> 89 total
>>
>>
>>> ps axf | grep starter | grep grid-arcce | wc -l
>> 20
>>
>>> wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
>> 1078 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
>>
>> [2]
>>> cat
>> /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks
>>> echo $?
>> 0
>>
>> [2.b]
>>> ps axf | grep -A 5 slot1_16
>> ...
>> 48029 ?        Ss     0:00      \_ condor_starter -f -a slot1_16
>> grid-arcce0.desy.de
>> 48034 ?        Ss     0:00      |   \_ /bin/bash -l
>> /var/lib/condor/execute/dir_48029/condor_exec.exe
>> 48089 ?        S      0:00      |       \_ /usr/bin/time -o
>> /var/lib/condor/execute/dir_48029/5d0KDmvIspsnntDnJpfbFDFoABFKDmABFKDmjFbbDmABFKDmVI3j3m.diag
>>
>>
>> [2.c]
>>> grep 48034 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
>> 48034
>>
>>> grep 48034
>> /sys/fs/cgroup/cpu\,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks
>>
>>> echo $?
>> 0
>>
>> [3]
>>> findmnt | grep  "\["
>> ââ/home                                    /dev/sda6[/home] ext4
>> rw,relatime,data=ordered
>> ââ/tmp                                     /dev/sda6[/tmp]  ext4
>> rw,relatime,data=ordered
>>
>>
>> On 2018-06-07 20:41, Todd Tannenbaum wrote:
>>> On 6/7/2018 10:44 AM, Thomas Hartmann wrote:
>>>> Hi all,
>>>>
>>>> I just noticed, that a few of our nodes have their jobs not confined in
>>>> cgroups - i.e., no condor slice at all [1]. These nodes are setup the
>>>> same and on the same release [2] as the majority of the nodes where the
>>>> jobs are properly cgrouped.
>>>> We are going to drain and reboot these nodes, but maybe somebody has an
>>>> idea, what might have gone wrong here?
>>>>
>>>> Cheers,
>>>>    Thomas
>>>>
>>>
>>> Hi Thomas,
>>>
>>> Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on these two machines?  The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like:
>>>
>>> # ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_*
>>>
>>> Hope the above helps
>>> Todd
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature