Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] blacklisted local host Collector
- Date: Fri, 27 Mar 2015 10:00:55 +0000
- From: Richard Crozier <richard.crozier@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] blacklisted local host Collector
On 26/03/15 15:41, Todd Tannenbaum wrote:
On 3/26/2015 9:29 AM, Richard Crozier wrote:
Hello,
I'm running a personal condor pool on a machine with 64 nodes.
sometimes.
condor_status -total -debug
03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping
I gather from other mailing list posts this means the localhost will be
skipped for an hour?
Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?
If an HTCondor tool or daemon is attempting to query a collector and a)
that connection attempt failed, and b) it took an abnormally long period
of time to fail, then that tool or daemon will not attempt to connect
with that collector for a default of one hour. You can control the time
via config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste info
from section 3.3 of the HTCondor Manual is at the bottom of this email ).
Thanks this is helpful, my google-fu was weak. I suppose actually
*looking in the manual* is a reasonable place to start. :-)
As to why it is happening, that is a bigger mystery. Does it happen all
the time or only on occasion? It would appear that the collector is
failing to accept the incoming connection from condor_status fast
enough. Maybe the CollectorLog can provide some clues? Random guesses:
maybe the collector process is blocked on I/O for many seconds trying to
write (perhaps to the CollectorLog) to a volume that is NFS mounted and
currently down, or perhaps the collector is being hammered by many
simultaneous instances of condor_status running in the background, or
perhaps the collector process is CPU starved because 64 jobs are running
on the same box (in which case I'd suggest setting
JOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lower
priority than the HTCondor system services themselves), ....
It happens only occasionally. I was only using 32 out of 64 processors
for the jobs in this instance, although another user (there's only one
other heavy user) might have been using up more at the same time, not
via condor, but I doubt this really. I might RENICE if it happens again
to see if this makes any difference, thanks.
Condor status would not have been called by any process more than around
once per second, and only by one (master) process (see below). Condor
should only be writing to local hard drives on the local machine. The
whole setup is the default "personl" machine setup chosen when I
installed Condor from the Ubuntu package repositories (I'm using Mint
Linux 17).
I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.
Why not just queue up thousands of jobs at once and be done with it?
Hope the above helps,
Todd
Well, what I'm actually doing is using is running a pool of Matlab (or
Octave) slaves controlled by a master Matlab process. These slaves are
used to run an embarrassingly parallel computation (a genetic algorithm
based optimisation). They communicate and are controlled via a shared
directory. I do it this way because the Matlab launch time is
non-trivial. The Matlab slaves never stop unless I tell them to, just
keep listening for more jobs from the master. They sometimes fail, in
which case my master launches new slaves via condor, or kills all the
slaves depending on the circumstances, generally managing them to keep
their number to a predefined limit. Using condor also lets me check I’m
not blocking other users who might want to use the pool. Only the master
calls condor_status, and fairly infrequently (many seconds, or minutes
between calls normally).
The odd thing is, even when I am getting this message when calling
condor_status from matlab, I do not get the same response when doing it
manually on the same machine outside of matlab at the same time.
Generally I am running matlab via 'gnu screen' so I can log out. Matlab
calls the condor_status command via it's 'system' command. The only
thing I can think of is that Matlab might set up or clear environment
variables in the shell in which it runs the commands which might affect
this. I know, for example, matlab sets it's own LD_LIBRARY_PATH
variable, because I had to clear this specially set path or else I get
errors from condor which then can't find needed libraries.
It's difficult to replicate the error. When/if it happens again I'll try
to send the Collector Log.
I tried to set DEAD_COLLECTOR_MAX_AVOIDANCE_TIME using:
$ condor_config_val -set "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60"
but got:
Attempt to set configuration "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60" on
master chameleon <127.0.0.1:44031> failed.
Instead I opened /etc/condor/condor_config
and added the line
DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60
Is this right? I couldn't find DEAD_COLLECTOR_MAX_AVOIDANCE_TIME
existing anywhere in this file already.
Finally, in case it's relevant:
$ condor_version
$CondorVersion: 8.0.5 Jan 14 2014 BuildID: Debian-8.0.5~dfsg.1-1ubuntu1
Debian-8.0.5~dfsg.1-1ubuntu1 $
$CondorPlatform: X86_64-Ubuntu_ $
Thanks for the help.
Richard