[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Avoid failing nodes? (automatically?)



condor-users-bounces@xxxxxxxxxxx schrieb am 11/30/2007 08:16:50 AM:

> Good morning,
> 
> every now and then, in a pool that's quite old, I see disk problems 
> resulting in filesystems remounted read-only. 
> Such a node will happily accept Condor jobs, fail running them, and
> be re-negotiated for another one (from the same user, due to still 
active
> claims).
> This is like a black hole, eating all jobs in no time.
> Is there a way to avoid such a situation (except monitoring all the 
nodes
> continuously, which may be impossible locally - when a monitor script
> cannot run anymore because of the disk failure - and would impose extra
> network load if done remotely)? Limit the rate of jobs being negotiated
> to an individual node? A "learning" process on the negotiator side which
> "sees" that this node doesn't produce successful job terminations 
anymore?

Maybe match_list_length and LastMatchName0 in job requirements is what you 
need (see documentation of condor_submit). There is also an example in 
section 5.3.7.3 of the manual (this section is related to Grid 
match-making, but the same mechanism works for normal jobs, if I 
understand correctly).

Regards,
Jan Ploski