[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node



Alain Roy wrote:
   * There are lots black holes: machines that cause segfaults (how
     do you distinguish from a user job that just segfaults?),
     machines that cause jobs to run slowly (how do you distinguish
     from slow jobs?), and machines that cause jobs to exit quickly.


I agree that it's nice to have such a black hole system, but it's definitely a challenge.

I am wondering if information collection of my cluster might be a good place to start to see if there is a pattern that blackholes exhibit that may be different from say a standard failing job. For example a blackhole would be user independent. For example a single users jobs all disappearing in say 120s or less would indicate a specific users problem whereas a node that gobbles up jobs irrespective of a user would flag much more strongly for being a blackhole. If there is a distinctive pattern then it might be easier to devise a counter measure.

Terrence


-alain



_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users