[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] blacklisted local host Collector

On 3/26/2015 9:29 AM, Richard Crozier wrote:

I'm running a personal  condor pool on a machine with 64 nodes. sometimes.

condor_status -total -debug

03/26/15 13:40:25 Collector blacklisted; skipping

I gather from other mailing list posts this means the localhost will be
skipped for an hour?

Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?

If an HTCondor tool or daemon is attempting to query a collector and a) that connection attempt failed, and b) it took an abnormally long period of time to fail, then that tool or daemon will not attempt to connect with that collector for a default of one hour. You can control the time via config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste info from section 3.3 of the HTCondor Manual is at the bottom of this email ).

As to why it is happening, that is a bigger mystery. Does it happen all the time or only on occasion? It would appear that the collector is failing to accept the incoming connection from condor_status fast enough. Maybe the CollectorLog can provide some clues? Random guesses: maybe the collector process is blocked on I/O for many seconds trying to write (perhaps to the CollectorLog) to a volume that is NFS mounted and currently down, or perhaps the collector is being hammered by many simultaneous instances of condor_status running in the background, or perhaps the collector process is CPU starved because 64 jobs are running on the same box (in which case I'd suggest setting JOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lower priority than the HTCondor system services themselves), ....

I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.

Why not just queue up thousands of jobs at once and be done with it? Ie do a "queue 10000" in your submit file. Or if you have hundreds of thousands/millions of jobs, you could submit them as a simple DAGMan job and let DAGMan throttle the submissions. FWIW, DAGMan throttles submissions not by looking at condor_status, but instead by looking at how many jobs are idle. When too few jobs are idle, it submits new jobs... when to many jobs are idle, it stops submitting new jobs. This algorithm works under more situations and is simpler than looking at machine resources and trying to figure out how many more jobs to submit. Just food for thought.

Hope the above helps,

From the HTCondor Manual ---

Defines the interval of time (in seconds) between checks for a failed primary condor_collector daemon. If connections to the dead primary condor_collector take very little time to fail, new attempts to query the primary condor_collector may be more frequent than the specified maximum avoidance time. The default value equals one hour. This variable has relevance to flocked jobs, as it defines the maximum time they may be reporting to the primary condor_collector without the condor_negotiator noticing.