Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored

Date: Tue, 01 Sep 2020 15:07:58 -0500
From: "Alec Sheperd (reply-all)" <alec.sheperd@xxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored

Hi Todd,

That all sounds correct, yes. Sorry, I was trying to keep it brief.

We're running 8.9.6 currently, although this issue had start while wewere still on 8.7, and this running in the vanilla universe.


The basic work flow of the job submission is:

1. A process on the same node as the schedd queues X amount of jobs(usually around ~2000)2. When they begin running, they each open a socket with the processrunning on the schedd, which will distribute the jobs work. So thereends up being a lot of open ports on the host.3. Once the process on the schedd runs out of work for the jobs to do,it will issue a condor_rm to all the running jobs.

I've tried rule out issues pertaining to file descriptor counts, andgeneral networking problems, but so far I haven't found issue with those.

I've also tried setting the KillSig=9 in the submit file for both thereal jobs and my test script, which will instantly remove the test jobwith condor_rm, but still does not remove the real jobs.

Let me know if there is any other relevant information you would like meto add.


Alec

On 9/1/20 11:26 AM, Todd Tannenbaum wrote:

Hi Alec,
I am getting a little lost when reading the below, please help clarifyfor me....
My understanding is as follows:
You are using startd rank (i.e. RANK = XXX in your condor config) toprefer running specific jobs submitted by a specific user.Â When thesejobs run and you then issue a condor_rm, the job is sent a SIGTERM butis not sent a SIGKILL after 10 minutes (your MaxVacateTime).Â However,when you submit a test script as the same specific user, your testscript indeed does receive the SIGKILL 10 minutes after SIGTERM.
Did I get it right?
If so, it may help to focus on any differences between how yousubmitted your test job and how your real jobs are submitted. Forexample, perhaps your test job is universe=vanilla and your real jobsare universe=docker?Â Also what version of HTCondor are you using?
Thanks
Todd



On 9/1/2020 9:58 AM, Alec Sheperd wrote:
Hello,
I've been having a long standing issue with our Condor cluster that Ihave not been able to crack, primarily pertaining to jobs not beingissued SIGKILL after having be allocated the time specified inMaxVacateTime.
Some background info: There are certain jobs that need to runcorresponding with specific events that occur. In order to satisfythis, we have rank preemption set up for these jobs that getsubmitted under a specific user to have them start ASAP. I'm not 100%knowledgeable on the code being run, but the general idea is thatthese jobs will run until removed by other means (i.e. they willnever exit of their own accord). This normally has just been done byissuing a condor_rm once the work they are doing has been deemedcomplete.
In more recent times, either due to changes in the host machines,condor configuration, or the code itself, the jobs will never getremoved via condor_rm, and have to be killed locally on the executehost by issuing a KILLSIG to both the starter and child process.
The child process does not properly handle SIGTERM, and for reasonbeyond my scope, I cannot do much to change this on the code side.However, it seems strange to me that a SIGKILL does not seem to besent after reaching the MaxVacateTime which is set to MaxVacateTime =10 * $(MINUTE). Not only that, but the KILLING_TIMEOUT for the startddoes not seem honored either, which at the default 30 seconds.Watching with a strace, it seems to confirm that the SIGKILLS arenever issued in these cases.
I've tested it with scripts like

#!/bin/bash
trap "echo 'do nothing'" SIGTERM
while :; do :; done
Which seems to work however, so I'm not sure. I've wondered if rankexpressions prevent this from happening? Running as the user withrank preemption for the above script still seems to do the correctthing ultimately though.
Any thoughts or ideas to test would be greatly appreciated!

Alec

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
  - From: Alec Sheperd
- Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
  - From: Todd Tannenbaum

Prev by Date: [HTCondor-users] Reminder: free HTCondor workshop virtual event in three weeks, join us!
Next by Date: Re: [HTCondor-users] Various options to fill depth first
Previous by thread: Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
Next by thread: [HTCondor-users] Reminder: free HTCondor workshop virtual event in three weeks, join us!
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored