[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] nodes without cgrouped jobs?
- Date: Wed, 20 Jun 2018 12:54:48 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] nodes without cgrouped jobs?
On 6/20/2018 9:37 AM, Thomas Hartmann wrote:
> Hi all,
> there are ongoing nodes appearing where jobs are not placed into their
> own slice but are just below the main condor cgroup. E.g., one node has
> currently 20 jobs running while has only six dedicated sub slices in the
> CPU cgroup below the main condor slice .
My response to the above is the same as last time (repeated below for convenience):
Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on the execute machines where you suspect problems? The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like:
# ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_*
So.... assuming "condor_config_val BASE_CGROUP" reports the default value of "htcondor", then I don't understand what you are trying to understand by looking at the "/sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/" subdirectory (managed by systemd?) vs the directory HTCondor is managing which would be (by default) "/sys/fs/cgroup/cpu,cpuacct/htcondor/" subdirectory.
If the value of "condor_config_val BASE_CGROUP" is something other than "htcondor", you should find the person who changed that value in your configuration and ask them what they were trying to accomplish. Or get rid of that change and go with HTCondor's defaults.
Hope the above helps,
> I.e., it looks like that jobs get started without a cgroup getting created
> Even more wired, there are slices, that do not have any PIDs assigned -
> e.g., on this node slot1_16 has got a sub-slice created , it does not
> look like that any starter/exec got actually started into the created
> sub-group (but ended up again under the parent condor group).
> Side note: on some of these nodes we have (ro) bind-mounted
> /sys/fs/cgroup into a Singularity container.
> However, that should(?!) not affect any Condor process running outside
> this container below the root namespace (I assume...) - at least the
> bind-mount is not appearing in the root namespace 
>> ls -1 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/ | grep
> slot | wc -l
>> wc -l
> 89 total
>> ps axf | grep starter | grep grid-arcce | wc -l
>> wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
> 1078 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
>> echo $?
>> ps axf | grep -A 5 slot1_16
> 48029 ? Ss 0:00 \_ condor_starter -f -a slot1_16
> 48034 ? Ss 0:00 | \_ /bin/bash -l
> 48089 ? S 0:00 | \_ /usr/bin/time -o
>> grep 48034 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks
>> grep 48034
>> echo $?
>> findmnt | grep "\["
> ââ/home /dev/sda6[/home] ext4
> ââ/tmp /dev/sda6[/tmp] ext4
> On 2018-06-07 20:41, Todd Tannenbaum wrote:
>> On 6/7/2018 10:44 AM, Thomas Hartmann wrote:
>>> Hi all,
>>> I just noticed, that a few of our nodes have their jobs not confined in
>>> cgroups - i.e., no condor slice at all . These nodes are setup the
>>> same and on the same release  as the majority of the nodes where the
>>> jobs are properly cgrouped.
>>> We are going to drain and reboot these nodes, but maybe somebody has an
>>> idea, what might have gone wrong here?
>> Hi Thomas,
>> Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on these two machines? The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like:
>> # ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_*
>> Hope the above helps
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at:
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685