Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored

Date: Tue, 01 Sep 2020 09:58:10 -0500
From: Alec Sheperd <alec.sheperd@xxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored

Hello,

I've been having a long standing issue with our Condor cluster that Ihave not been able to crack, primarily pertaining to jobs not beingissued SIGKILL after having be allocated the time specified inMaxVacateTime.

Some background info: There are certain jobs that need to runcorresponding with specific events that occur. In order to satisfy this,we have rank preemption set up for these jobs that get submitted under aspecific user to have them start ASAP. I'm not 100% knowledgeable on thecode being run, but the general idea is that these jobs will run untilremoved by other means (i.e. they will never exit of their own accord).This normally has just been done by issuing a condor_rm once the workthey are doing has been deemed complete.

In more recent times, either due to changes in the host machines, condorconfiguration, or the code itself, the jobs will never get removed viacondor_rm, and have to be killed locally on the execute host by issuinga KILLSIG to both the starter and child process.

The child process does not properly handle SIGTERM, and for reasonbeyond my scope, I cannot do much to change this on the code side.However, it seems strange to me that a SIGKILL does not seem to be sentafter reaching the MaxVacateTime which is set to MaxVacateTime = 10 *$(MINUTE). Not only that, but the KILLING_TIMEOUT for the startd doesnot seem honored either, which at the default 30 seconds. Watching witha strace, it seems to confirm that the SIGKILLS are never issued inthese cases.


I've tested it with scripts like

#!/bin/bash
trap "echo 'do nothing'" SIGTERM
while :; do :; done

Which seems to work however, so I'm not sure. I've wondered if rankexpressions prevent this from happening? Running as the user with rankpreemption for the above script still seems to do the correct thingultimately though.


Any thoughts or ideas to test would be greatly appreciated!

Alec

Follow-Ups:
- Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
  - From: Todd Tannenbaum

Next by Date: Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
Next by thread: Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored