[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
- Date: Wed, 9 Aug 2006 16:31:35 +0100
- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
On 8/8/06, Nomura Kohei <kh-nomura@xxxxxxxxx> wrote:
>> 3.) Shutting down the NIC on the executor
I have done same as 3) on my condor pool.
My condor pool consists of 3 windows machine with v6.8.0.
The job has been successfully re-scheduled and run.
See attached log file of the job,
I have set JobLeaseDuration to 60 second.
But it took 2 hours from shutting down the NIC to rescheduling.
(After the job had been executed, I cut the NIC immediately.)
Does JobLeaseDuration work effectively??
Assuming you mean
001 (290.000.000) 08/03 13:41:31 Job executing on host: <192.168.0.2:3817>
about 60 secs later pull the NIC
022 (290.000.000) 08/03 15:41:36 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to vm1@xxxxxxxxxxxxxxxxxxx <192.168.0.2:3817>
024 (290.000.000) 08/03 15:41:37 Job reconnection failed
Then that looks a bit bad - following all off top of my head but I
think it is still valid for latest versions...
POLLING_INTERVAL is for the startd but ALIVE_INTERVAL is how often the
schedd sends keep alive. I think the default is 5mins (setting in
MAX_SHADOW_EXCEPTIONS then determines how often an error can occur
before it gives up so in theory MAX_SHADOW_EXCEPTIONS * ALIVE_INTERVAL
should determine how long before (if there is no lease logic allowing
an extension) a bad claim is perceived as being kept by the schedd.
Have you set these to numbers which would give you 2 hours delay?
If not this suggests it might be an issue
What is the schedd/shadow log indicating during this time?