[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Large number of queued jobs slows causes schedd timeout during negotiation



I have 2500 nice'd jobs queued on this machine. During the negotiation
cycle the schedd connection is timing out:

3/2 15:37:44   Negotiating with nice-user.ichesal@xxxxxxxxxx at
<137.57.142.112:57094>
3/2 15:37:44   Calculating schedd limit with the following parameters
3/2 15:37:44     ScheddPrio       = 20000000.000000
3/2 15:37:44     ScheddPrioFactor = 10000000.000000
3/2 15:37:44     scheddShare      = 0.000000
3/2 15:37:44     scheddAbsShare   = 0.000000
3/2 15:37:44     ScheddUsage      = 0
3/2 15:37:44     scheddLimit      = 500000
3/2 15:37:44     MaxscheddLimit   = 500000
3/2 15:37:44 Socket to <137.57.142.112:57094> not in cache, creating one
3/2 15:37:44 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
3/2 15:37:44 SocketCache:  Found unused slot 14
3/2 15:37:44     Sending SEND_JOB_INFO/eom
3/2 15:37:44     Getting reply from schedd ...
3/2 15:38:14 condor_read(): timeout reading buffer.
3/2 15:38:14     Failed to get reply from schedd
3/2 15:38:14   Error: Ignoring schedd for this cycle

I have "NEGOTIATE_ALL_JOBS_IN_CLUSTER = False" on this schedd machine. I
thought that might help with response tite. 2500 idle jobs seems like a
pretty paltry amount to cause negotiation problems. Is there some way I
can improve the response of the schedd so this timeout doesn't occur?
Maybe by playing with NEGOTIATOR_TIMEOUT_MULTIPLIER?

Thanks.

- Ian C.