[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node



Having counters for specific error codes could help with this. If you could define error codes that condor watches and logs at
each job's termination simple expressions like "if 80% of the last 10 jobs terminated with code 1 go offline / run a
self-test-and-fix-problems script / send mail to the admin".

If we had this option wrapper scripts could certainly pass the required information by using standard and custom error codes.
Just a quick idea, though.

Cheers,
Szabolcs





*********** REPLY SEPARATOR  ***********

On 1/23/2006 at 4:08 PM Alain Roy wrote:

>>Hmm, the preferable solution would be if the central manager could flag
>>nodes that have cycled through say 10 jobs in the last 120seconds and
>>mark that node as bad. I was hoping that condor perhaps had some
>>functionality to deal with this situation.
>
>The problem is that it's very hard to do this in general. For instance:
>
>   * Although Condor isn't optimized for short-running jobs,
>     it's not unusual for users to submit them.
>
>   * Negotiation cycles are often long enough that a scheme like
>     you describe won't happen even if there is a black hole.
>
>   * There are lots black holes: machines that cause segfaults (how
>     do you distinguish from a user job that just segfaults?),
>     machines that cause jobs to run slowly (how do you distinguish
>     from slow jobs?), and machines that cause jobs to exit quickly.
>
>I agree that it's nice to have such a black hole system, but it's 
>definitely a challenge.
>
>-alain