[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] blacklisted local host Collector



On Mar 27, 2015, at 5:00 AM, Richard Crozier <richard.crozier@xxxxxxxxxxx> wrote:

On 26/03/15 15:41, Todd Tannenbaum wrote:
On 3/26/2015 9:29 AM, Richard Crozier wrote:
Hello,

I'm running a personal  condor pool on a machine with 64 nodes.
sometimes.

condor_status -total -debug

03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping

I gather from other mailing list posts this means the localhost will be
skipped for an hour?

Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?

If an HTCondor tool or daemon is attempting to query a collector and a)
that connection attempt failed, and b) it took an abnormally long period
of time to fail, then that tool or daemon will not attempt to connect
with that collector for a default of one hour.  You can control the time
via config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste info
from section 3.3 of the HTCondor Manual is at the bottom of this email ).


Thanks this is helpful, my google-fu was weak. I suppose actually *looking in the manual* is a reasonable place to start. :-)

As to why it is happening, that is a bigger mystery. Does it happen all
the time or only on occasion? It would appear that the collector is
failing to accept the incoming connection from condor_status fast
enough. Maybe the CollectorLog can provide some clues?  Random guesses:
maybe the collector process is blocked on I/O for many seconds trying to
write (perhaps to the CollectorLog) to a volume that is NFS mounted and
currently down, or perhaps the collector is being hammered by many
simultaneous instances of condor_status running in the background, or
perhaps the collector process is CPU starved because 64 jobs are running
on the same box (in which case I'd suggest setting
JOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lower
priority than the HTCondor system services themselves), ....


It happens only occasionally. I was only using 32 out of 64 processors for the jobs in this instance, although another user (there's only one other heavy user) might have been using up more at the same time, not via condor, but I doubt this really. I might RENICE if it happens again to see if this makes any difference, thanks.

Condor status would not have been called by any process more than around once per second, and only by one (master) process (see below). Condor should only be writing to local hard drives on the local machine. The whole setup is the default "personl" machine setup chosen when I installed Condor from the Ubuntu package repositories (I'm using Mint Linux 17).

I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.


Why not just queue up thousands of jobs at once and be done with it?

Hope the above helps,
Todd


Well, what I'm actually doing is using is running a pool of Matlab (or Octave) slaves controlled by a master Matlab process. These slaves are used to run an embarrassingly parallel computation (a genetic algorithm based optimisation). They communicate and are controlled via a shared directory. I do it this way because the Matlab launch time is non-trivial. The Matlab slaves never stop unless I tell them to, just keep listening for more jobs from the master. They sometimes fail, in which case my master launches new slaves via condor, or kills all the slaves depending on the circumstances, generally managing them to keep their number to a predefined limit. Using condor also lets me check I’m not blocking other users who might want to use the pool. Only the master calls condor_status, and fairly infrequently (many seconds, or minutes between calls normally).

The odd thing is, even when I am getting this message when calling condor_status from matlab, I do not get the same response when doing it manually on the same machine outside of matlab at the same time. Generally I am running matlab via 'gnu screen' so I can log out. Matlab calls the condor_status command via it's 'system' command. The only thing I can think of is that Matlab might set up or clear environment variables in the shell in which it runs the commands which might affect this. I know, for example, matlab sets it's own LD_LIBRARY_PATH variable, because I had to clear this specially set path or else I get errors from condor which then can't find needed libraries.

It's difficult to replicate the error. When/if it happens again I'll try to send the Collector Log.

I tried to set DEAD_COLLECTOR_MAX_AVOIDANCE_TIME using:

$ condor_config_val -set "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60"

but got:

Attempt to set configuration "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60" on master chameleon <127.0.0.1:44031> failed.

Instead I opened /etc/condor/condor_config

and added the line

DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60

Is this right? I couldn't find DEAD_COLLECTOR_MAX_AVOIDANCE_TIME existing anywhere in this file already.

Finally, in case it's relevant:

$ condor_version
$CondorVersion: 8.0.5 Jan 14 2014 BuildID: Debian-8.0.5~dfsg.1-1ubuntu1 Debian-8.0.5~dfsg.1-1ubuntu1 $
$CondorPlatform: X86_64-Ubuntu_ $

The dead collector avoidance feature is intended for pools that have multiple collectors set up for fault tolerance. Daemons send their information to all of the collectors and queriers pick a random collector to contact. If one collector goes down, queriers will avoid it after failing to contact it for the first time. This feature isn’t useful when you only have one collector.

We’ve had another user report the same problem that you’re experiencing, occurring occasionally when they do thousands of queries. We haven’t been able to track down the cause yet.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project