[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node



Hi,

When I had a similar problem I wrote a kind of a wrapper around each
program. Basically in order to submit to Condor you had to implement the
wrapper's structure. Now, what that wrapper did was just to wrap the
program by try catch. The job won't just disappear. It usually has some
kind of exception. In case of an exception the wrapper wrote it to the
console (for example...) with the computer's name and it was returned to
the submitter via Condor. There I have put a script that counted the
failures in each machine and then automatically added to each submit
file "Machine!=X" in case there were more then Y failures from a certain
type during a certain period.
Then I actually found the problem which led to those nodes becoming
"black holes" because I found what the failing machines had in common. 

Regards,

Anton Kucherov

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Terrence Martin
Sent: Tuesday, January 24, 2006 1:50 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Black hole node

Alain Roy wrote:
>    * There are lots black holes: machines that cause segfaults (how
>      do you distinguish from a user job that just segfaults?),
>      machines that cause jobs to run slowly (how do you distinguish
>      from slow jobs?), and machines that cause jobs to exit quickly.
>
>   

> I agree that it's nice to have such a black hole system, but it's 
> definitely a challenge.
>
>   
I am wondering if information collection of my cluster might be a good 
place to start to see if there is a pattern that blackholes exhibit that

may be different from say a standard failing job. For example a 
blackhole would be user independent. For example a single users jobs all

disappearing in say 120s or less would indicate a specific users problem

whereas a node that gobbles up jobs irrespective of a user would flag 
much more strongly for being a blackhole.  If there is a distinctive 
pattern then it might be easier to devise a counter measure.

Terrence


> -alain
>
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>   

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users