Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] blacklisted local host Collector

Date: Fri, 27 Mar 2015 10:00:55 +0000
From: Richard Crozier <richard.crozier@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] blacklisted local host Collector

On 26/03/15 15:41, Todd Tannenbaum wrote:

On 3/26/2015 9:29 AM, Richard Crozier wrote:

Hello,

I'm running a personal  condor pool on a machine with 64 nodes.
sometimes.

condor_status -total -debug

03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping

I gather from other mailing list posts this means the localhost will be
skipped for an hour?

Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?


If an HTCondor tool or daemon is attempting to query a collector and a)
that connection attempt failed, and b) it took an abnormally long period
of time to fail, then that tool or daemon will not attempt to connect
with that collector for a default of one hour.  You can control the time
via config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste info
from section 3.3 of the HTCondor Manual is at the bottom of this email ).

Thanks this is helpful, my google-fu was weak. I suppose actually*looking in the manual* is a reasonable place to start. :-)

As to why it is happening, that is a bigger mystery. Does it happen all
the time or only on occasion? It would appear that the collector is
failing to accept the incoming connection from condor_status fast
enough. Maybe the CollectorLog can provide some clues?  Random guesses:
maybe the collector process is blocked on I/O for many seconds trying to
write (perhaps to the CollectorLog) to a volume that is NFS mounted and
currently down, or perhaps the collector is being hammered by many
simultaneous instances of condor_status running in the background, or
perhaps the collector process is CPU starved because 64 jobs are running
on the same box (in which case I'd suggest setting
JOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lower
priority than the HTCondor system services themselves), ....

It happens only occasionally. I was only using 32 out of 64 processorsfor the jobs in this instance, although another user (there's only oneother heavy user) might have been using up more at the same time, notvia condor, but I doubt this really. I might RENICE if it happens againto see if this makes any difference, thanks.

Condor status would not have been called by any process more than aroundonce per second, and only by one (master) process (see below). Condorshould only be writing to local hard drives on the local machine. Thewhole setup is the default "personl" machine setup chosen when Iinstalled Condor from the Ubuntu package repositories (I'm using MintLinux 17).

I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.


Why not just queue up thousands of jobs at once and be done with it?

Hope the above helps,
Todd

Well, what I'm actually doing is using is running a pool of Matlab (orOctave) slaves controlled by a master Matlab process. These slaves areused to run an embarrassingly parallel computation (a genetic algorithmbased optimisation). They communicate and are controlled via a shareddirectory. I do it this way because the Matlab launch time isnon-trivial. The Matlab slaves never stop unless I tell them to, justkeep listening for more jobs from the master. They sometimes fail, inwhich case my master launches new slaves via condor, or kills all theslaves depending on the circumstances, generally managing them to keeptheir number to a predefined limit. Using condor also lets me check I’mnot blocking other users who might want to use the pool. Only the mastercalls condor_status, and fairly infrequently (many seconds, or minutesbetween calls normally).

The odd thing is, even when I am getting this message when callingcondor_status from matlab, I do not get the same response when doing itmanually on the same machine outside of matlab at the same time.Generally I am running matlab via 'gnu screen' so I can log out. Matlabcalls the condor_status command via it's 'system' command. The onlything I can think of is that Matlab might set up or clear environmentvariables in the shell in which it runs the commands which might affectthis. I know, for example, matlab sets it's own LD_LIBRARY_PATHvariable, because I had to clear this specially set path or else I geterrors from condor which then can't find needed libraries.

It's difficult to replicate the error. When/if it happens again I'll tryto send the Collector Log.


I tried to set DEAD_COLLECTOR_MAX_AVOIDANCE_TIME using:

$ condor_config_val -set "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60"

but got:

Attempt to set configuration "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60" onmaster chameleon <127.0.0.1:44031> failed.


Instead I opened /etc/condor/condor_config

and added the line

DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60

Is this right? I couldn't find DEAD_COLLECTOR_MAX_AVOIDANCE_TIMEexisting anywhere in this file already.


Finally, in case it's relevant:

$ condor_version

$CondorVersion: 8.0.5 Jan 14 2014 BuildID: Debian-8.0.5~dfsg.1-1ubuntu1Debian-8.0.5~dfsg.1-1ubuntu1 $

$CondorPlatform: X86_64-Ubuntu_ $


Thanks for the help.

Richard

Follow-Ups:
- Re: [HTCondor-users] blacklisted local host Collector
  - From: Jaime Frey

References:
- [HTCondor-users] blacklisted local host Collector
  - From: Richard Crozier
- Re: [HTCondor-users] blacklisted local host Collector
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] Configuring a CE/Schedd
Next by Date: Re: [HTCondor-users] Configuring a CE/Schedd
Previous by thread: Re: [HTCondor-users] blacklisted local host Collector
Next by thread: Re: [HTCondor-users] blacklisted local host Collector
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] blacklisted local host Collector