[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] misconfigured node



On Tue, May 27, 2014 at 11:12 AM, Ben Cotton
<ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
> On Sat, May 24, 2014 at 7:45 AM, Rita <rmorgan466@xxxxxxxxx> wrote:
>> i know a user can setup a "Blackhole" policy but I was wondering if there is
>> something I can do on the startd side to avoid black holes. Would it be
>> possible to run a test to see if the blackhole problem is occurring?
>>
> It's certainly possible, but I don't know how practical it would be.
> Assuming you know what the root cause of the black hole state is (for
> example, I've seen it happen when NFS mounts hang on the execute
> node), you could write a test that runs as a startd cron. The START
> expression on the execute node could then take that into account.
>
> For example, if you know it's because of bad NFS mounts, you can have
> your test run the mount command and if it doesn't return within N
> seconds, it would publish NODE_CHECK_MOUNTS = False. A basic start
> expression for a dedicated execute node would basically be START =
> $(NODE_CHECK_MOUNTS)

i'm looking at tackling this issue currently and would be interested
in any scripts or thoughts on how best to do the tests.  i'm a little
leary as i understand a bunch of filesystem tests at the linux level
will effectively hang on the IO's and never return.  if condor cron
were setup to cycle every 1s and check the mount, i could see a stack
of processes backing up

> I'd be interested in hearing what leads to a black hole state. I
> started a wiki page
> (https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=BlackHoleConditions)
> for people to document these conditions as they find them.

GPU's failing is a problem for us.  usually from some underlying
hardware issue.  sometimes the job will either not start immediately
on the gpu's or because the gpu has a bad piece of memory (the more
common issue), which means the job will run for sometime (could be
short less then 1min) and then quit, which effectively creates a
blackhole for us

> For those unaware of the blackhole policy referred to above, see:
> https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles