[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] activate_claim failing when many jobs start at once

Hi all,

we have recently increased the size of our StartDs and are seeing strange failures during job starts. The machines have a single partitionable 192-CPU StartD versus the 2 x 96-CPU StartD layout we were using previously.
The setup is puppetized to be the same aside from merging two partitionable StartDs into one.

What we observe is that if the large machines pull jobs after draining, there is a huge number of failures when the Shadow requests the claim from the StartD. The StartD cannot reply because the socket is closed [0] and the Shadow times out waiting for the reply [1]. There are several dozens of these failures when things go wrong; it could be that the timeout happens before the failed write as well, we cannot match both sides accurately.
Strangely, it looks like the critical volumes is between 96-100 jobs starting at once on the same StartD. Below that everything works fine, above that many more jobs fail than just the surplus. So it looks like we hit some limit at which Condor is not able to handle all the jobs at once.

Is there any knob we should look at to help with many job starts? Some known issue, be it in Condor itself or if we messed up e.g. the networking? Should we just put a limit on how many jobs may start at once?


PS: In case itâs relevant, these are identical test jobs created with `queue 100` (or whatever volume we test with).

[0] StartLog
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Got activate_claim request from shadow (2a00:139c:3:2e5:0:61:d2:6c)
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) condor_write(): Socket closed when trying to write 29 bytes to <[2a00:139c:3:2e5:0:61:d2:6c]:15444>, fd is 12
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) Buf::write(): condor_write() failed
12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Can't send eom to shadow.

[1] ShadowLog
12/03/21 12:12:37 (pid:3615484) (D_ALWAYS) (15310.259) (3615484): condor_read(): timeout reading 21 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxx
12/03/21 12:12:37 (pid:3614255) (D_ALWAYS) (15310.701) (3614255): RemoteResource::killStarter(): Could not send command to startd

Attachment: smime.p7s
Description: S/MIME cryptographic signature