Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] misconfigured node

Date: Tue, 27 May 2014 11:12:25 -0400
From: Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] misconfigured node

On Sat, May 24, 2014 at 7:45 AM, Rita <rmorgan466@xxxxxxxxx> wrote:
> i know a user can setup a "Blackhole" policy but I was wondering if there is
> something I can do on the startd side to avoid black holes. Would it be
> possible to run a test to see if the blackhole problem is occurring?
>
It's certainly possible, but I don't know how practical it would be.
Assuming you know what the root cause of the black hole state is (for
example, I've seen it happen when NFS mounts hang on the execute
node), you could write a test that runs as a startd cron. The START
expression on the execute node could then take that into account.

For example, if you know it's because of bad NFS mounts, you can have
your test run the mount command and if it doesn't return within N
seconds, it would publish NODE_CHECK_MOUNTS = False. A basic start
expression for a dedicated execute node would basically be START =
$(NODE_CHECK_MOUNTS)

I'd be interested in hearing what leads to a black hole state. I
started a wiki page
(https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=BlackHoleConditions)
for people to document these conditions as they find them.

For those unaware of the blackhole policy referred to above, see:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles

Thanks,
BC

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing

References:
- [HTCondor-users] misconfigured node
  - From: Rita

Prev by Date: Re: [HTCondor-users] Remote Submit Fails To Spool Job Files
Next by Date: Re: [HTCondor-users] condor_config_val issue
Previous by thread: [HTCondor-users] misconfigured node
Next by thread: [HTCondor-users] disable StreamErr and StreamOut
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] misconfigured node