[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Sudden negotiator issues (high CPU loads, condor_q timeouts)




Long negotiation cycles can be caused by poor auto-clustering of jobs. Look under "Monitoring Health of the Negotiator" on the following page:

http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools

--Dan

Pascal Jermini wrote:
Hello all,

we are currently running a Condor pool (version 7.2.4 on central manager and
submit host), with ~1200 slots and ~5000 jobs in the queue.

All of a sudden (since last Sunday), we started seeing timeouts with condor_q,
condor_reschedule would hang for half a minute, and the CPU on the central
manager/negotiator would spike to 100% usage for the duration of the
negotiation (which could last up to 120 seconds).

by looking at the Negotiator log, we can see that messages about the
negotiation phase of the different jobs scroll rather slowly, and once the
negotiation is done, condor_q works again.

The central manager is quite a beefy machine (upgraded its hardware less than
one month ago), and no modification to the Condor config has been made since
now a year (at least).

Any clue about what could be the problem?

thanks,

Pascal

PS: I don't know if this is relevant or not, but by doing an strace attached
to the condor_negotiator process, we get lines after lines of the following
syscall:

rt_sigaction(SIGFPE, {0x818c89a, [], 0}, {0x811322c, ~[HUP INT ILL TRAP ABRT
BUS KILL SEGV USR2 ALRM CONT TTIN WINCH RT_1 RT_2 RT_3 RT_8 RT_9 RT_12 RT_14
RT_15 RT_16 RT_17 RT_18 RT_19 RT_20 RT_21 RT_22 RT_24 RT_25 RT_26 RT_27 RT_29
RT_30 RT_31], 0}, 8) = 0 rt_sigaction(SIGFPE, {0x811322c, ~[HUP INT ILL TRAP
ABRT BUS KILL SEGV USR2 ALRM CONT TTIN WINCH RT_1 RT_2 RT_3 RT_8 RT_9 RT_12
RT_14 RT_15 RT_16 RT_17 RT_18 RT_19 RT_20 RT_21 RT_22 RT_24 RT_25 RT_26 RT_27
RT_29 RT_30 RT_31], 0}, {0x818c89a, ~[HUP INT ILL TRAP ABRT BUS KILL SEGV
USR2 ALRM CONT TTIN WINCH RT_1 RT_2 RT_3 RT_8 RT_9 RT_12 RT_14 RT_15 RT_16
RT_17 RT_18 RT_19 RT_20 RT_21 RT_22 RT_24 RT_25 RT_26 RT_27 RT_29 RT_30
RT_31], 0}, 8) = 0
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/