Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] blacklisted local host Collector

Date: Thu, 26 Mar 2015 10:41:15 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] blacklisted local host Collector

On 3/26/2015 9:29 AM, Richard Crozier wrote:

Hello,

I'm running a personal  condor pool on a machine with 64 nodes. sometimes.

condor_status -total -debug

03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping

I gather from other mailing list posts this means the localhost will be
skipped for an hour?

Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?

If an HTCondor tool or daemon is attempting to query a collector and a)that connection attempt failed, and b) it took an abnormally long periodof time to fail, then that tool or daemon will not attempt to connectwith that collector for a default of one hour. You can control the timevia config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste infofrom section 3.3 of the HTCondor Manual is at the bottom of this email ).

As to why it is happening, that is a bigger mystery. Does it happen allthe time or only on occasion? It would appear that the collector isfailing to accept the incoming connection from condor_status fastenough. Maybe the CollectorLog can provide some clues? Random guesses:maybe the collector process is blocked on I/O for many seconds trying towrite (perhaps to the CollectorLog) to a volume that is NFS mounted andcurrently down, or perhaps the collector is being hammered by manysimultaneous instances of condor_status running in the background, orperhaps the collector process is CPU starved because 64 jobs are runningon the same box (in which case I'd suggest settingJOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lowerpriority than the HTCondor system services themselves), ....

I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.

Why not just queue up thousands of jobs at once and be done with it? Iedo a "queue 10000" in your submit file. Or if you have hundreds ofthousands/millions of jobs, you could submit them as a simple DAGMan joband let DAGMan throttle the submissions. FWIW, DAGMan throttlessubmissions not by looking at condor_status, but instead by looking athow many jobs are idle. When too few jobs are idle, it submits newjobs... when to many jobs are idle, it stops submitting new jobs. Thisalgorithm works under more situations and is simpler than looking atmachine resources and trying to figure out how many more jobs to submit.Just food for thought.


Hope the above helps,
Todd

From the HTCondor Manual ---

DEAD_COLLECTOR_MAX_AVOIDANCE_TIME

Defines the interval of time (in seconds) between checks for afailed primary condor_collector daemon. If connections to the deadprimary condor_collector take very little time to fail, new attempts toquery the primary condor_collector may be more frequent than thespecified maximum avoidance time. The default value equals one hour.This variable has relevance to flocked jobs, as it defines the maximumtime they may be reporting to the primary condor_collector without thecondor_negotiator noticing.

Follow-Ups:
- Re: [HTCondor-users] blacklisted local host Collector
  - From: Richard Crozier

References:
- [HTCondor-users] blacklisted local host Collector
  - From: Richard Crozier

Prev by Date: [HTCondor-users] blacklisted local host Collector
Next by Date: [HTCondor-users] Obtaining condor usage statistics
Previous by thread: [HTCondor-users] blacklisted local host Collector
Next by thread: Re: [HTCondor-users] blacklisted local host Collector
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] blacklisted local host Collector