Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER

Date: Fri, 16 Apr 2010 10:48:42 -0400
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER


On Apr 16, 2010, at 10:14 , Greg Thain wrote:

On 04/15/2010 03:48 PM, Peter Doherty wrote:
While trying to debug a problem in condor I stumbled across thesetwo config options:
TOOL_TIMEOUT_MULTIPLIER
SUBMIT_TIMEOUT_MULTIPLIER
Are these documented anywhere? A search of the Condor manual 7.4.2shows nothing, and google barely gets any hits on them either.
Peter:
They aren't documented anywhere. I'll write something up, and tryto get it into the manual.
In the meantime, I'll try to give a quick explanation of what theydo. In condor, there are all kinds of places where we timeout anetwork communication. Many of those timeouts are 20 seconds, butsome others have different arbitrary timeouts. SettingSUBSYS_TIMEOUT_MULTIPLIER to some integral value, multiplies thosetimeouts by the setting's value. SUBSYS is the name of thesubsystem that a daemon is a part of. Usually this is the daemon'sname, like SCHEDD, STARTD, etc. There is one subsys, "TOOL", whichcovers most of the command-line tools.
Ideally, you shouldn't ever need to set these, but sometimes, it canbe a useful hack to work around some other problem. When you up onetimeout like this, there's is always the danger of causing acascading timeout somewhere else, so that's something you need to beaware of.
-Greg



Thanks Greg,

So this is a tangent, but here's what I'm wondering. I found aboutabout these variables here:

http://spinningmatt.wordpress.com/2009/12/16/timeouts-from-condor_rm-and-condor_submit/

I was trying to find out why our submit host hangs when we do acondor_rm on a lot of jobs. Actually, What I'm doing is condor_rm ona DAGMAN that has about 5000-8000 jobs queued up. Condor typicallybecomes unresponsive, condor_q fails to fetch ads, the system justspins its wheels, and after waiting about 2 hours, it still oftendoesn't work. I have to kill and restart condor a few times to get itunwedged, and then the jobs start clearing out of the queue.

Here's what I found on that blog that caught my eye. "If a condor_rm,condor_q, condor_submit, etc happens during a negotiation, there is agood chance it may timeout."

I've set our SCHEDD_INTERVAL, NEGOTIATOR_INTERVAL, andNEGOTIATOR_CYCLE_DELAY to 5 seconds. This was needed to be able toget all those 5000 jobs queued and running quickly, otherwise it wastaking hours and hours. But I'm wondering if this means that when5000 jobs are being condor_rm'ed, that every 5 seconds the rm getsinterrupted by a negotiator cycle, and so the rm isn't completingreliably, and this constant feedback loop is making condor non-responsive.


Peter

References:
- [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER
  - From: Peter Doherty

Prev by Date: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER
Next by Date: [Condor-users] New user question: Eviction of long jobs
Previous by thread: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER
Next by thread: [Condor-users] New user question: Eviction of long jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] documentation for TOOL_TIMEOUT_MULTIPLIER