Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Two hour delay in detecting execute host failure

Date: Thu, 12 Feb 2009 23:58:39 +0530
From: Sateesh Potturu <sateeshpnv@xxxxxxxxx>
Subject: [Condor-users] Two hour delay in detecting execute host failure

Hello Todd et al,

This problem is still affecting us, even with 7.2.0. We tested today
also and found that shadow takes up to 2 hours to detect an execute
node falling off the network. Is there a way to detect execute node
failure faster so that the job gets retried immediately on a different
execute node instead of after 2 hours?

This is the old thread on the same problem --
https://lists.cs.wisc.edu/archive/condor-users/2007-March/msg00026.shtml

-- 
Regards,
Sateesh

Prev by Date: [Condor-users] Problem with periodic_release and globus_resubmit
Next by Date: Re: [Condor-users] Problem with periodic_release and globus_resubmit
Previous by thread: Re: [Condor-users] Problem with periodic_release and globus_resubmit
Next by thread: [Condor-users] Condorview issues with Job Stats
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Two hour delay in detecting execute host failure