[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER

On Apr 16, 2010, at 10:14 , Greg Thain wrote:

On 04/15/2010 03:48 PM, Peter Doherty wrote:
While trying to debug a problem in condor I stumbled across these two config options:

Are these documented anywhere? A search of the Condor manual 7.4.2 shows nothing, and google barely gets any hits on them either.


They aren't documented anywhere. I'll write something up, and try to get it into the manual.

In the meantime, I'll try to give a quick explanation of what they do. In condor, there are all kinds of places where we timeout a network communication. Many of those timeouts are 20 seconds, but some others have different arbitrary timeouts. Setting SUBSYS_TIMEOUT_MULTIPLIER to some integral value, multiplies those timeouts by the setting's value. SUBSYS is the name of the subsystem that a daemon is a part of. Usually this is the daemon's name, like SCHEDD, STARTD, etc. There is one subsys, "TOOL", which covers most of the command-line tools.

Ideally, you shouldn't ever need to set these, but sometimes, it can be a useful hack to work around some other problem. When you up one timeout like this, there's is always the danger of causing a cascading timeout somewhere else, so that's something you need to be aware of.


Thanks Greg,

So this is a tangent, but here's what I'm wondering. I found about about these variables here:

I was trying to find out why our submit host hangs when we do a condor_rm on a lot of jobs. Actually, What I'm doing is condor_rm on a DAGMAN that has about 5000-8000 jobs queued up. Condor typically becomes unresponsive, condor_q fails to fetch ads, the system just spins its wheels, and after waiting about 2 hours, it still often doesn't work. I have to kill and restart condor a few times to get it unwedged, and then the jobs start clearing out of the queue.

Here's what I found on that blog that caught my eye. "If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout."

I've set our SCHEDD_INTERVAL, NEGOTIATOR_INTERVAL, and NEGOTIATOR_CYCLE_DELAY to 5 seconds. This was needed to be able to get all those 5000 jobs queued and running quickly, otherwise it was taking hours and hours. But I'm wondering if this means that when 5000 jobs are being condor_rm'ed, that every 5 seconds the rm gets interrupted by a negotiator cycle, and so the rm isn't completing reliably, and this constant feedback loop is making condor non- responsive.