[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor fault tolerance



Paul Marshall wrote:
Hello,

I haven't been able to find any more up-to-date information on this issue:

https://www-auth.cs.wisc.edu/lists/condor-users/2007-March/msg00026.shtml

Could someone point me in the right direction? What is the best way to
decrease the time that it takes Condor to recognize a node has failed
and drop it from the system?

There's work going on to reverse the keepalive message direction:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=671

You can experiment with that on your own by setting the following on both your startd's and schedd's:

STARTD_SENDS_ALIVES=true

-- Lans Carstensen