Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node

Date: Tue, 24 Jan 2006 11:11:50 +0100
From: "Horvatth Szabolcs" <szabolcs@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] Black hole node

Having counters for specific error codes could help with this. If you could define error codes that condor watches and logs at
each job's termination simple expressions like "if 80% of the last 10 jobs terminated with code 1 go offline / run a
self-test-and-fix-problems script / send mail to the admin".

If we had this option wrapper scripts could certainly pass the required information by using standard and custom error codes.
Just a quick idea, though.

Cheers,
Szabolcs

*********** REPLY SEPARATOR  ***********

On 1/23/2006 at 4:08 PM Alain Roy wrote:

>>Hmm, the preferable solution would be if the central manager could flag
>>nodes that have cycled through say 10 jobs in the last 120seconds and
>>mark that node as bad. I was hoping that condor perhaps had some
>>functionality to deal with this situation.
>
>The problem is that it's very hard to do this in general. For instance:
>
>   * Although Condor isn't optimized for short-running jobs,
>     it's not unusual for users to submit them.
>
>   * Negotiation cycles are often long enough that a scheme like
>     you describe won't happen even if there is a black hole.
>
>   * There are lots black holes: machines that cause segfaults (how
>     do you distinguish from a user job that just segfaults?),
>     machines that cause jobs to run slowly (how do you distinguish
>     from slow jobs?), and machines that cause jobs to exit quickly.
>
>I agree that it's nice to have such a black hole system, but it's 
>definitely a challenge.
>
>-alain

Follow-Ups:
- Re: [Condor-users] Black hole node
  - From: Horvatth Szabolcs

References:
- [Condor-users] Black hole node
  - From: Terrence Martin
- Re: [Condor-users] Black hole node
  - From: Matt Hope
- Re: [Condor-users] Black hole node
  - From: Terrence Martin
- Re: [Condor-users] Black hole node
  - From: Alain Roy

Prev by Date: Re: [Condor-users] condor & MPI, part 2
Next by Date: Re: [Condor-users] Black hole node
Previous by thread: Re: [Condor-users] Black hole node
Next by thread: Re: [Condor-users] Black hole node
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Black hole node