[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor



Hi Matt,

3.) Shutting down the NIC on the executor

I have done same as 3) on my condor pool.
My condor pool consists of 3 windows machine with v6.8.0.
The job has been successfully re-scheduled and run.

See attached log file of the job,
I have set JobLeaseDuration to 60 second.
But it took 2 hours from shutting down the NIC to rescheduling.
(After the job had been executed, I cut the NIC immediately.)

Does JobLeaseDuration work effectively??

Thanks,
Kohei

---- log file ----
000 (290.000.000) 08/03 13:41:25 Job submitted from host: <192.168.0.1:4822>
...
001 (290.000.000) 08/03 13:41:31 Job executing on host: <192.168.0.2:3817>
...
006 (290.000.000) 08/03 13:41:39 Image size of job updated: 24200
...
022 (290.000.000) 08/03 15:41:36 Job disconnected, attempting to reconnect
   Socket between submit and execute hosts closed unexpectedly
   Trying to reconnect to vm1@xxxxxxxxxxxxxxxxxxx <192.168.0.2:3817>
...
024 (290.000.000) 08/03 15:41:37 Job reconnection failed
   Job disconnected too long: JobLeaseDuration (60 seconds) expired
   Can not reconnect to vm1@xxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (290.000.000) 08/03 15:50:20 Job executing on host: <192.168.0.3:1044>
...
005 (290.000.000) 08/03 15:52:37 Job terminated.
(1) Normal termination (return value 0)
 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
35317016  -  Run Bytes Sent By Job
135168  -  Run Bytes Received By Job
35317016  -  Total Bytes Sent By Job
135168  -  Total Bytes Received By Job
...



----- Original Message ----- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Sent: Thursday, August 03, 2006 5:32 PM
Subject: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor


On 8/3/06, thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx
<thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx> wrote:


Hi Matt,

For 1.) and 2.) the behaviour is just fine! -- I've also followed the discussion
regarding disk failure.
Maybe the documentation should state more clearly that
Condors default behaviour is to restart a job in case if a fault
(I might have overseen that).

I guess that is kind of percieved as the 'proper' default behaviour
for a job queue system.
Note that by using the periodic_* and on_exit_* expressions on
submission you can change this

Regarding 3.)
I gave it over an hour I think.

What is your job lease duration (if you are using it)

I've updated my Executors to 6.8 but the behaviour persists.
Do you think moving the central manager to 6.8 can resolve this?

Shadows failing to die when their starter is not talking to them
anymore is not something an upgrade to the collector/negotiator can
solve.

If your executors are on 6.8 you probably want your submitters to be
6.8 as well...

Matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR