[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory



Running the command in a bash wrapper would allow you to know when the process is killed by oom-killer

#!/bin/bash

REAL_COMMAND="/usr/bin/tail /dev/zero"

$($REAL_COMMAND)

ecode=$?

if [ $ecode -ne 0 ]; then

     echo "$REAL_COMMAND returned $ecode" >&2   

fi

exit $ecode#!/bin/bash


$ecode will be (128+9) = 137 when the oom_killer kills it.

 

For testing, the oom condition will probably occur quicker if you:

sudo swapoff -a

 

First

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Friday, October 6, 2023 at 6:19 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory

*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

On 10/5/23 10:03, Vikrant Aggarwal wrote:
> Hello Experts,
>
> We want to capture the signal to copy some logs before the scratch
> directory disappears after the job goes into hold status because of
> memory breach but we are unsuccessful to do it. Do we have any way to
> achieve this? We thought it was probably a job wrapper which is doing
> exec to run actual condor jobs not allowing us to capture the signal
> but that's not the case.


The Linux out-of-memory signal uses signal 9, which is uncatchable.  You
could write a startd policy which evicts jobs when their MemoryUsage is
some percentage of the total, and if the job has

when_to_transfer_output = ON_EXIT_OR_EVICT

then the scratch directory would get copied back to the spool on the AP

-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!Cn_UX_p3!kPq2VvB_tBPHWmh23MYygMjrE5Dw8Mx5pgjsG2oVz0Jo132dDf4dlFE-q8VYs2Iv187T42l8Xf_TKe9CTkUjmEDW4uqa_5z3$

The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!Cn_UX_p3!kPq2VvB_tBPHWmh23MYygMjrE5Dw8Mx5pgjsG2oVz0Jo132dDf4dlFE-q8VYs2Iv187T42l8Xf_TKe9CTkUjmEDW4t0YP8kz$