[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node



Hmm, the preferable solution would be if the central manager could flag nodes that have cycled through say 10 jobs in the last 120seconds and mark that node as bad. I was hoping that condor perhaps had some functionality to deal with this situation. It seems to me that is the natural place to put such a component. As for submitting superfluous jobs that might be a work around since a blackhole would also suck up a test job as well as good jobs. That still makes you vulnerable if your test jobs are run only after a hundred or more good jobs get sucked passed the event horizon never to return. I suppose it is better than nothing.

Terrence


Matt Hope wrote:
On 1/23/06, Terrence Martin <tmartin@xxxxxxxxxxxxxxxx> wrote:
Is there a way in condor to tell the system to not send any more jobs to
a node if that node is acting as a blackhole for jobs? For example a
node is allowing the jobs to start, then some problem with the node
immediately kills the job and the node goes back to saying it can take
more.

I've found that external monitoring combined with decent hardware
checking software in a controlled farm works very well. Perhaps not
the best advice for people cycle stealing I know.

An external monitor which spots hard disk/memory failures and switches
the node into a state where it won't kill the existing job but does
prevent new ones from starting catches most nasties.

it is possible you can spot a machine which is running a higher than
usual proportion of jobs (it will spend a lot more time in
claimed/Idle than claimed busy for example) but applying such
heuristics to take automatic action can be dangerous. Of course a
simple report isn't likely to help.

The most likely case is that where a single users claim then executes
many jobs (unproductively). If this corresponds to a real exit code
you can try having your users spot this and alerting in some way.

None of these are terribly useful on their own.

If you have some spare capacity or can handle the throughput loss you
could submit 'canary' jobs whose only purpose is to fail on machines
in a bad state (say buy rapidly trying to read/write all memory) or
execute some required installed app/framework.

Matt

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users