Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor

Date: Wed, 9 Aug 2006 16:31:35 +0100
From: "Matt Hope" <matthew.hope@xxxxxxxxx>
Subject: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor

On 8/8/06, Nomura Kohei <kh-nomura@xxxxxxxxx> wrote:

>> 3.) Shutting down the NIC on the executor

I have done same as 3) on my condor pool.
My condor pool consists of 3 windows machine with v6.8.0.
The job has been successfully re-scheduled and run.

See attached log file of the job,
I have set JobLeaseDuration to 60 second.
But it took 2 hours from shutting down the NIC to rescheduling.
(After the job had been executed, I cut the NIC immediately.)

Does JobLeaseDuration work effectively??


Assuming you mean

001 (290.000.000) 08/03 13:41:31 Job executing on host: <192.168.0.2:3817>


about 60 secs later pull the NIC

022 (290.000.000) 08/03 15:41:36 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to vm1@xxxxxxxxxxxxxxxxxxx <192.168.0.2:3817>
024 (290.000.000) 08/03 15:41:37 Job reconnection failed


Then that looks a bit bad - following all off top of my head but I
think it is still valid for latest versions...

POLLING_INTERVAL is for the startd but ALIVE_INTERVAL is how often the
schedd sends keep alive. I think the default is 5mins (setting in
seconds)
MAX_SHADOW_EXCEPTIONS then determines how often an error can occur
before it gives up so in theory MAX_SHADOW_EXCEPTIONS * ALIVE_INTERVAL
should determine how long before (if there is no lease logic allowing
an extension) a bad claim is perceived as being kept by the schedd.

Have you set these to numbers which would give you 2 hours delay?

If not this suggests it might be an issue

What is the schedd/shadow log indicating during this time?

Matt

Follow-Ups:
- Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
  - From: Nomura Kohei

References:
- Re: [Condor-users] Fault Behaviour of Condor
  - From: Matt Hope
- [Condor-users] Antwort: Re: Fault Behaviour of Condor
  - From: thomas . t . hoppe
- Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
  - From: Matt Hope
- Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
  - From: Nomura Kohei

Prev by Date: [Condor-users] 192.168.50.1??
Next by Date: Re: [Condor-users] Ref: Condor Idle job state
Previous by thread: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
Next by thread: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor