Re: [HTCondor-users] activate_claim failing when many jobs start at once

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Mon, 6 Dec 2021 18:13:13 -0600

From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] activate_claim failing when many jobs start at once

Hi Max,

Hope all is going well with you, and thank you for your report below.

Quick question: When you were using the 2 x 96-CPU configuration, where you achieving this via two startds on the server, or one startd configured with two pslots?

At any rate, we are investigating the below situation here at UW. We think we can reproduce the situation you described. Preliminary profiling shows there may be a non-scalable data structure in the condor_procd that is slowing down the rate at which the startd can spawn condor_starters, causing a cascading timeout. We will see if we can improve this situation, and also look at adding a config knob to "rate limit" how many claims can be activated against one startd in rapid succession.

Let's keep in touch on this....

regards
Todd

On 12/3/2021 8:08 AM, Fischer, Max (SCC) wrote:

Hi all,

we have recently increased the size of our StartDs and are seeing strange failures during job starts. The machines have a single partitionable 192-CPU StartD versus the 2 x 96-CPU StartD layout we were using previously.
The setup is puppetized to be the same aside from merging two partitionable StartDs into one.

What we observe is that if the large machines pull jobs after draining, there is a huge number of failures when the Shadow requests the claim from the StartD. The StartD cannot reply because the socket is closed [0] and the Shadow times out waiting for the reply [1]. There are several dozens of these failures when things go wrong; it could be that the timeout happens before the failed write as well, we cannot match both sides accurately.
Strangely, it looks like the critical volumes is between 96-100 jobs starting at once on the same StartD. Below that everything works fine, above that many more jobs fail than just the surplus. So it looks like we hit some limit at which Condor is not able to handle all the jobs at once.

Is there any knob we should look at to help with many job starts? Some known issue, be it in Condor itself or if we messed up e.g. the networking? Should we just put a limit on how many jobs may start at once?

Cheers,
Max

PS: In case itâs relevant, these are identical test jobs created with `queue 100` (or whatever volume we test with).

[0] StartLog
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Got activate_claim request from shadow (2a00:139c:3:2e5:0:61:d2:6c)
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) condor_write(): Socket closed when trying to write 29 bytes to <[2a00:139c:3:2e5:0:61:d2:6c]:15444>, fd is 12
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) Buf::write(): condor_write() failed
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Can't send eom to shadow.

[1] ShadowLog
12/03/21 12:12:37 (pid:3615484) (D_ALWAYS) (15310.259) (3615484): condor_read(): timeout reading 21 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxx.
12/03/21 12:12:37 (pid:3614255) (D_ALWAYS) (15310.701) (3614255): RemoteResource::killStarter(): Could not send command to startd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685

Mailing List Archives

Public Access

Re: [HTCondor-users] activate_claim failing when many jobs start at once