Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Avoid failing nodes? (automatically?)

Date: Fri, 30 Nov 2007 08:16:50 +0100
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: [Condor-users] Avoid failing nodes? (automatically?)

Good morning,

every now and then, in a pool that's quite old, I see disk problems 
resulting in filesystems remounted read-only. 
Such a node will happily accept Condor jobs, fail running them, and
be re-negotiated for another one (from the same user, due to still active
claims).
This is like a black hole, eating all jobs in no time.
Is there a way to avoid such a situation (except monitoring all the nodes
continuously, which may be impossible locally - when a monitor script
cannot run anymore because of the disk failure - and would impose extra
network load if done remotely)? Limit the rate of jobs being negotiated
to an individual node? A "learning" process on the negotiator side which
"sees" that this node doesn't produce successful job terminations anymore?

Cheers,
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html

Follow-Ups:
- Re: [Condor-users] Avoid failing nodes? (automatically?)
  - From: Steven Timm
- Re: [Condor-users] Avoid failing nodes? (automatically?)
  - From: Matt Hope
- Re: [Condor-users] Avoid failing nodes? (automatically?)
  - From: Jan Ploski

Prev by Date: [Condor-users] GID process tracking
Next by Date: Re: [Condor-users] Avoid failing nodes? (automatically?)
Previous by thread: Re: [Condor-users] GID process tracking
Next by thread: Re: [Condor-users] Avoid failing nodes? (automatically?)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Avoid failing nodes? (automatically?)