Re: [HTCondor-users] Limiting memory used on the worker node with c-groups

On 4/24/20 5:16 PM, tpdownes@xxxxxxxxx wrote:

When somethingÂin the universe goes wrong with HTCondor and CGroups, I feel a little twitch. When you say the processes are in the "deferred" state, do you mean they are in the "D" state according to ps? Or do you mean the actual literal "job deferral" options in "htcondor"?

Hello Tom,

Thank you very much. You are right, I misused the term "deferred",
I was talking about "D" state.


A common reason for a job getting stuck in D is a bad / overloaded remote filesystem (NFS, etc.). Is that a possibility here?

Using the command mentioned in the article you mention, I see lines such as :

ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep sgmali
30138 30333 sgmali0+ D    87.4 aliroot       mem_cgroup_oom_synchronize
30341 30435 sgmali0+ D     0.3 perl          mem_cgroup_oom_synchronize
12455 30605 sgmali0+ D     0.0 perl          mem_cgroup_oom_synchronize
12594 30869 sgmali0+ D     0.0 perl          mem_cgroup_oom_synchronize

FYI: even if you didn't understand my presentation, you made the type of choice I recommend. Use "soft" but lie a bit about how much RAM you have. It allows more jobs to match while still ensuring that CGroups can do its job.
It is always more difficult to fully understand slides if you do not
hear the presenter :-) I hope there is no perceived offense here.


a) these processes in "D" state started to appear after I activated the
   "soft" mode on workers, so I think there is a link.

b) I do not exclude the possibility that the jobs themselves are
   reacting badly to a signal. These are production jobs of the
   LHC ALICE VO and I am only running this VO (no comparison).

c) meanwhile I modified one worker to use the "hard" mode and seems to
   behave OK, I did not find removed jobs on this worker in the last
   24h or so. This is one point I did not understand : what is the
   potential issue with the "hard" mode ?

Thank you.


