[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Avoid failing nodes? (automatically?)



Steffen--there are a couple ways to accomplish what you want.
The crudest way is just to put the condor logs of the startd on
the same partition as where the execute directory is likely
to be, that way if the disk on which you are running fails
or is full, the startd will just crash and the node will drop out
of the pool.

There are also hawkeye scripts by which you can add disk health
and disk space to the classad of each machine and put requirements
on your jobs that it has both.

Finally there is a field you can check in the machine classad
that can see how many jobs the node has started recently but
I forget what it is.

Steve


On Fri, 30 Nov 2007, Steffen Grunewald wrote:

Good morning,

every now and then, in a pool that's quite old, I see disk problems
resulting in filesystems remounted read-only.
Such a node will happily accept Condor jobs, fail running them, and
be re-negotiated for another one (from the same user, due to still active
claims).
This is like a black hole, eating all jobs in no time.
Is there a way to avoid such a situation (except monitoring all the nodes
continuously, which may be impossible locally - when a monitor script
cannot run anymore because of the disk failure - and would impose extra
network load if done remotely)? Limit the rate of jobs being negotiated
to an individual node? A "learning" process on the negotiator side which
"sees" that this node doesn't produce successful job terminations anymore?

Cheers,
Steffen



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.