[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] pre-kill warning signals to jobs?



On 3/20/24 03:35, Thomas Hartmann wrote:
Hi all,

a not fully fermented idea, but is there a way in Condor for the startd to send its job a signal on a predefined condition, e.g., for something like a warning when memory utilization is getting near to the requested limit?


Hi Thomas:

I like where you are going, but this may be hard to do with the tools we have today. Perhaps we need to ferment (and then even distill!) in order to get something useful to work.

Today, the startd can define a WANT_VACATE, and the job can define a custom soft-kill signal that will be first sent when WANT_VACATE is true. So, in theory, you could use these two to send some custom signal (SIGUSR1, maybe?). HOWEVER, a job can allocate memory very quickly, and there is a limit to how fast the startd sees the memory usage of the job. We'll still need a good way to notify the user. I wonder if there is a way to push the Jupyter notebook into it's own sub-cgroup of the job, and let the kernel kill the notebook when it goes over memory, leaving the parent job running to notify the user in some way?

-greg