[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node



On 1/23/06, Terrence Martin <tmartin@xxxxxxxxxxxxxxxx> wrote:
> Is there a way in condor to tell the system to not send any more jobs to
> a node if that node is acting as a blackhole for jobs? For example a
> node is allowing the jobs to start, then some problem with the node
> immediately kills the job and the node goes back to saying it can take
> more.

I've found that external monitoring combined with decent hardware
checking software in a controlled farm works very well. Perhaps not
the best advice for people cycle stealing I know.

An external monitor which spots hard disk/memory failures and switches
the node into a state where it won't kill the existing job but does
prevent new ones from starting catches most nasties.

it is possible you can spot a machine which is running a higher than
usual proportion of jobs (it will spend a lot more time in
claimed/Idle than claimed busy for example) but applying such
heuristics to take automatic action can be dangerous. Of course a
simple report isn't likely to help.

The most likely case is that where a single users claim then executes
many jobs (unproductively). If this corresponds to a real exit code
you can try having your users spot this and alerting in some way.

None of these are terribly useful on their own.

If you have some spare capacity or can handle the throughput loss you
could submit 'canary' jobs whose only purpose is to fail on machines
in a bad state (say buy rapidly trying to read/write all memory) or
execute some required installed app/framework.

Matt