[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Avoid failing nodes? (automatically?)



On Nov 30, 2007 7:16 AM, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
> Good morning,

> This is like a black hole, eating all jobs in no time.

Can you trigger a script to run at remount? this would be the cleanest
solution since it can check the file system and just 1) stop condor
and complain, 2) alter the config such that the jobs that rely on that
file system will no longer be accepted.

In most of the cases I've seen targetting the specific problem is best
since the general 'trap' case may well suffer from too many false
positives.

That said SMP boxes where one job manages to screw the other nodes on
the machine are a nightmare to deal with and something that lets you
easily say (no more of me on this machine for a while would be lovely)

Matt