[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Avoid failing nodes? (automatically?)



Good morning,

every now and then, in a pool that's quite old, I see disk problems 
resulting in filesystems remounted read-only. 
Such a node will happily accept Condor jobs, fail running them, and
be re-negotiated for another one (from the same user, due to still active
claims).
This is like a black hole, eating all jobs in no time.
Is there a way to avoid such a situation (except monitoring all the nodes
continuously, which may be impossible locally - when a monitor script
cannot run anymore because of the disk failure - and would impose extra
network load if done remotely)? Limit the rate of jobs being negotiated
to an individual node? A "learning" process on the negotiator side which
"sees" that this node doesn't produce successful job terminations anymore?

Cheers,
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html