[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] activate_claim failing when many jobs start at once



Hi Todd,

same to you! Hope you are doing well in these times.

Our 2 x 96 CPU setup is a single startd with two partitionable slots. In case it is relevant, the two slots are of different SLOT_TYPE with different START rules and allowed submitters â there is no way for the same job cluster to match both slot types.

Weâve now started skimming over our older nodes and are seeing spurious failures there as well. It looks like the trend holds that more jobs starting at once triggers the issue more reliably, but the limit is not as precise as we initially thought. It seems to occasionally happen in the 32-64 slot range, but there it is rare enough that it does not trigger alarms/failsafes.
Let me know if you need any specific data â Iâm afraid itâs too much to spot the one useful needle in the haystack.

In case anyone else has the issue and needs a temporary workaround: We have limited the number of claims per startd per negotiation cycle by using NUM_CLAIMS = 16 and CLAIM_WORKLIFE = 0.

Cheers,
Max

On 7. Dec 2021, at 01:13, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

Hi Max,

Hope all is going well with you, and thank you for your report below.

Quick question: When you were using the 2 x 96-CPU configuration, where you achieving this via two startds on the server, or one startd configured with two pslots?

At any rate, we are investigating the below situation here at UW.  We think we can reproduce the situation you described. Preliminary profiling shows there may be a non-scalable data structure in the condor_procd that is slowing down the rate at which the startd can spawn condor_starters, causing a cascading timeout.  We will see if we can improve this situation, and also look at adding a config knob to "rate limit" how many claims can be activated against one startd in rapid succession.

Let's keep in touch on this....

regards
Todd


On 12/3/2021 8:08 AM, Fischer, Max (SCC) wrote:
Hi all,

we have recently increased the size of our StartDs and are seeing strange failures during job starts. The machines have a single partitionable 192-CPU StartD versus the 2 x 96-CPU StartD layout we were using previously.
The setup is puppetized to be the same aside from merging two partitionable StartDs into one.

What we observe is that if the large machines pull jobs after draining, there is a huge number of failures when the Shadow requests the claim from the StartD. The StartD cannot reply because the socket is closed [0] and the Shadow times out waiting for the reply [1]. There are several dozens of these failures when things go wrong; it could be that the timeout happens before the failed write as well, we cannot match both sides accurately.
Strangely, it looks like the critical volumes is between 96-100 jobs starting at once on the same StartD. Below that everything works fine, above that many more jobs fail than just the surplus. So it looks like we hit some limit at which Condor is not able to handle all the jobs at once.

Is there any knob we should look at to help with many job starts? Some known issue, be it in Condor itself or if we messed up e.g. the networking? Should we just put a limit on how many jobs may start at once?

Cheers,
Max

PS: In case itâs relevant, these are identical test jobs created with `queue 100` (or whatever volume we test with).

[0] StartLog
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Got activate_claim request from shadow (2a00:139c:3:2e5:0:61:d2:6c)
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) condor_write(): Socket closed when trying to write 29 bytes to <[2a00:139c:3:2e5:0:61:d2:6c]:15444>, fd is 12
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) Buf::write(): condor_write() failed
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Can't send eom to shadow.

[1] ShadowLog
12/03/21 12:12:37 (pid:3615484) (D_ALWAYS) (15310.259) (3615484): condor_read(): timeout reading 21 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxx.
12/03/21 12:12:37 (pid:3614255) (D_ALWAYS) (15310.701) (3614255): RemoteResource::killStarter(): Could not send command to startd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 

Attachment: smime.p7s
Description: S/MIME cryptographic signature