Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node

Date: Mon, 23 Jan 2006 13:24:14 -0800
From: Terrence Martin <tmartin@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Black hole node

Hmm, the preferable solution would be if the central manager could flagnodes that have cycled through say 10 jobs in the last 120seconds andmark that node as bad. I was hoping that condor perhaps had somefunctionality to deal with this situation. It seems to me that is thenatural place to put such a component. As for submitting superfluousjobs that might be a work around since a blackhole would also suck up atest job as well as good jobs. That still makes you vulnerable if yourtest jobs are run only after a hundred or more good jobs get suckedpassed the event horizon never to return. I suppose it is better thannothing.


Terrence


Matt Hope wrote:

On 1/23/06, Terrence Martin <tmartin@xxxxxxxxxxxxxxxx> wrote:

Is there a way in condor to tell the system to not send any more jobs to
a node if that node is acting as a blackhole for jobs? For example a
node is allowing the jobs to start, then some problem with the node
immediately kills the job and the node goes back to saying it can take
more.


I've found that external monitoring combined with decent hardware
checking software in a controlled farm works very well. Perhaps not
the best advice for people cycle stealing I know.

An external monitor which spots hard disk/memory failures and switches
the node into a state where it won't kill the existing job but does
prevent new ones from starting catches most nasties.

it is possible you can spot a machine which is running a higher than
usual proportion of jobs (it will spend a lot more time in
claimed/Idle than claimed busy for example) but applying such
heuristics to take automatic action can be dangerous. Of course a
simple report isn't likely to help.

The most likely case is that where a single users claim then executes
many jobs (unproductively). If this corresponds to a real exit code
you can try having your users spot this and alerting in some way.

None of these are terribly useful on their own.

If you have some spare capacity or can handle the throughput loss you
could submit 'canary' jobs whose only purpose is to fail on machines
in a bad state (say buy rapidly trying to read/write all memory) or
execute some required installed app/framework.

Matt

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Follow-Ups:
- Re: [Condor-users] Black hole node
  - From: Alain Roy

References:
- [Condor-users] Black hole node
  - From: Terrence Martin
- Re: [Condor-users] Black hole node
  - From: Matt Hope

Prev by Date: Re: [Condor-users] Black hole node
Next by Date: Re: [Condor-users] Black hole node
Previous by thread: Re: [Condor-users] Black hole node
Next by thread: Re: [Condor-users] Black hole node
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Black hole node