[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Two hour delay in detecting execute host failure

Hello Todd et al,

This problem is still affecting us, even with 7.2.0. We tested today
also and found that shadow takes up to 2 hours to detect an execute
node falling off the network. Is there a way to detect execute node
failure faster so that the job gets retried immediately on a different
execute node instead of after 2 hours?

This is the old thread on the same problem --