[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] pre-kill warning signals to jobs?



Hi Greg,

many thanks for the feedback - but yes, jobs/users can be *very* quick in growing their memory footprint ð (which is one of the problems, that also the native notebook memory plugins are sometimes not much of a help when they also trail behind an exponential memory growth)

The sub cgroup sounds like a quite interesting idea for notebooks! We (Max & I) noticed, that there is now sub-cgroup support with 23.10 with cgroups v2 - maybe it is really the way to go. But ad hoc I am not sure how to ensure precedence for the kernel OOM for PIDs in a sub cgroup rather than for the parent group. Maybe having the job group with more relaxed memory enforcement - and then be more strict with the PIDs in the sub group ð. Will have to check, what the notebooks/python fork actually and what PIDs (threads with v2??) could be moved one level below.

Something like that might also be useful for glideins or so that try to overbook there internal startds. I.e., allow a pilot job to run with a more relaxed memory enforcement. But then organize the payloads in sub cgroups - with optionally more strict resource enforement. Maybe something where an admin could allow/own some sub-cgroup controller filers to the user (like for reshuffling cpu shares within the glidein's overall share) but keep other controller files under admin control (to enforce certain memory behaviours).

Cheers,


----- Original Message -----
From: "Greg Thain via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Cc: "Greg Thain" <gthain@xxxxxxxxxxx>
Sent: Wednesday, 20 March, 2024 15:22:54
Subject: Re: [HTCondor-users] pre-kill warning signals to jobs?

On 3/20/24 03:35, Thomas Hartmann wrote:
> Hi all,
>
> a not fully fermented idea, but is there a way in Condor for the 
> startd to send its job a signal on a predefined condition, e.g., for 
> something like a warning when memory utilization is getting near to 
> the requested limit?


Hi Thomas:

I like where you are going, but this may be hard to do with the tools we 
have today. Perhaps we need to ferment (and then even distill!) in 
order to get something useful to work.

Today, the startd can define a WANT_VACATE, and the job can define a 
custom soft-kill signal that will be first sent when WANT_VACATE is 
true. So, in theory, you could use these two to send some custom signal 
(SIGUSR1, maybe?). HOWEVER, a job can allocate memory very quickly, and 
there is a limit to how fast the startd sees the memory usage of the 
job. We'll still need a good way to notify the user. I wonder if there 
is a way to push the Jupyter notebook into it's own sub-cgroup of the 
job, and let the kernel kill the notebook when it goes over memory, 
leaving the parent job running to notify the user in some way?

-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/