[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node




Hmm, the preferable solution would be if the central manager could flag
nodes that have cycled through say 10 jobs in the last 120seconds and
mark that node as bad. I was hoping that condor perhaps had some
functionality to deal with this situation.

The problem is that it's very hard to do this in general. For instance:

  * Although Condor isn't optimized for short-running jobs,
    it's not unusual for users to submit them.

  * Negotiation cycles are often long enough that a scheme like
    you describe won't happen even if there is a black hole.

  * There are lots black holes: machines that cause segfaults (how
    do you distinguish from a user job that just segfaults?),
    machines that cause jobs to run slowly (how do you distinguish
    from slow jobs?), and machines that cause jobs to exit quickly.

I agree that it's nice to have such a black hole system, but it's definitely a challenge.

-alain