[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory



On 10/5/23 10:03, Vikrant Aggarwal wrote:
Hello Experts,

We want to capture the signal to copy some logs before the scratch directory disappears after the job goes into hold status because of memory breach but we are unsuccessfulÂto do it. Do we have any way to achieve this? We thought it was probably a job wrapper which is doing exec to run actual condor jobs not allowing us to capture the signal but that's not the case.


The Linux out-of-memory signal uses signal 9, which is uncatchable. You could write a startd policy which evicts jobs when their MemoryUsage is some percentage of the total, and if the job has

when_to_transfer_output = ON_EXIT_OR_EVICT

then the scratch directory would get copied back to the spool on the AP

-greg