[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor



On 8/8/06, Nomura Kohei <kh-nomura@xxxxxxxxx> wrote:
>> 3.) Shutting down the NIC on the executor

I have done same as 3) on my condor pool.
My condor pool consists of 3 windows machine with v6.8.0.
The job has been successfully re-scheduled and run.

See attached log file of the job,
I have set JobLeaseDuration to 60 second.
But it took 2 hours from shutting down the NIC to rescheduling.
(After the job had been executed, I cut the NIC immediately.)

Does JobLeaseDuration work effectively??

Assuming you mean
001 (290.000.000) 08/03 13:41:31 Job executing on host: <192.168.0.2:3817>

about 60 secs later pull the NIC

022 (290.000.000) 08/03 15:41:36 Job disconnected, attempting to reconnect
  Socket between submit and execute hosts closed unexpectedly
  Trying to reconnect to vm1@xxxxxxxxxxxxxxxxxxx <192.168.0.2:3817>
024 (290.000.000) 08/03 15:41:37 Job reconnection failed

Then that looks a bit bad - following all off top of my head but I
think it is still valid for latest versions...

POLLING_INTERVAL is for the startd but ALIVE_INTERVAL is how often the
schedd sends keep alive. I think the default is 5mins (setting in
seconds)
MAX_SHADOW_EXCEPTIONS then determines how often an error can occur
before it gives up so in theory MAX_SHADOW_EXCEPTIONS * ALIVE_INTERVAL
should determine how long before (if there is no lease logic allowing
an extension) a bad claim is perceived as being kept by the schedd.

Have you set these to numbers which would give you 2 hours delay?

If not this suggests it might be an issue

What is the schedd/shadow log indicating during this time?

Matt