[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] misconfigured node



On Sat, May 24, 2014 at 7:45 AM, Rita <rmorgan466@xxxxxxxxx> wrote:
> i know a user can setup a "Blackhole" policy but I was wondering if there is
> something I can do on the startd side to avoid black holes. Would it be
> possible to run a test to see if the blackhole problem is occurring?
>
It's certainly possible, but I don't know how practical it would be.
Assuming you know what the root cause of the black hole state is (for
example, I've seen it happen when NFS mounts hang on the execute
node), you could write a test that runs as a startd cron. The START
expression on the execute node could then take that into account.

For example, if you know it's because of bad NFS mounts, you can have
your test run the mount command and if it doesn't return within N
seconds, it would publish NODE_CHECK_MOUNTS = False. A basic start
expression for a dedicated execute node would basically be START =
$(NODE_CHECK_MOUNTS)

I'd be interested in hearing what leads to a black hole state. I
started a wiki page
(https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=BlackHoleConditions)
for people to document these conditions as they find them.

For those unaware of the blackhole policy referred to above, see:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles


Thanks,
BC

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing