Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job disconnected, reconnect Failed, Condor schedules another execution, now I have two of the same jobs running at once

Date: Sun, 15 May 2011 15:31:56 -0400
From: Derrick Karimi <derrick.karimi@xxxxxxxxx>
Subject: [Condor-users] Job disconnected, reconnect Failed, Condor schedules another execution, now I have two of the same jobs running at once

Hi.

I am submitting to a windows pool about 1000 jobs that should in total take about 4 hours.

A small cluster (~20 jobs) complete successfully, but on larger ones that run longer I am seeing some problems, failure rates from 1%-10%. According to the log periodically it is loosing connection:

Grep'd:

000 (13272.000.000) 05/15 01:06:05 Job submitted from host: <10.44.7.143:49187>

001 (13272.000.000) 05/15 04:10:31 Job executing on host: <10.44.7.24:1050>

006 (13272.000.000) 05/15 04:10:39 Image size of job updated: 12540

006 (13272.000.000) 05/15 04:15:38 Image size of job updated: 136788

022 (13272.000.000) 05/15 06:15:39 Job disconnected, attempting to reconnect

024 (13272.000.000) 05/15 06:15:39 Job reconnection failed

001 (13272.000.000) 05/15 06:15:41 Job executing on host: <10.44.7.24:1052>

005 (13272.000.000) 05/15 06:15:50 Job terminated.

Detail:

...

022 (13272.000.000) 05/15 06:15:39 Job disconnected, attempting to reconnect

Socket between submit and execute hosts closed unexpectedly

Trying to reconnect to condor-05.lggm.llc <10.44.7.24:1050>

...

024 (13272.000.000) 05/15 06:15:39 Job reconnection failed

Job disconnected too long: JobLeaseDuration (1200 seconds) expired

Can not reconnect to condor-05.lggm.llc, rescheduling job

then it gets rerun

...

001 (13272.000.000) 05/15 06:15:41 Job executing on host: <10.44.7.24:1052>

...

005 (13272.000.000) 05/15 06:15:50 Job terminated.

(1) Normal termination (return value 0)

Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

884 - Run Bytes Sent By Job

1136 - Run Bytes Received By Job

884 - Total Bytes Sent By Job

1136 - Total Bytes Received By Job

My problem is that the job outputs to file that is on a shared filesystem. And from my own application's log, I can tell that there are two of them running at the same time. Both of them try to access the output file and one of them fails.

This can make some noisy results, because one will keep failing (I am using 5 Retry's in the requirements with JobRunCount and ExitCode == 0). If the job can't write to the output location it is generally considered an error. I never expected condor would be responsible for having the same job run in parallel at the same time.

Should the first job be evicted when condor looses communication? I have my NEGOTIATOR_INTERVAL set to 30 seconds, is that conflicting with some other timer thats on a default?

My submit file looks like this.

Universe = vanilla

Log = CONDOR.log

run_as_owner = true

requirements = substr(OpSys,0,5) == "WINNT"

concurrency_limits = ERDASENGINE

on_exit_remove = ExitCode == 0 || (ExitCode != 0 && JobRunCount >= 5)

+pid = "96_11"

Executable = 96_11_resampleprocess_24011_img_0.bat

Output = 96_11_resampleprocess_24011_img_0.out

Error = 96_11_resampleprocess_24011_img_0.err

+eid = "resampleprocess_24011_img_0"

Queue

--Derrick

Prev by Date: Re: [Condor-users] lightweight cluster/grid engine
Next by Date: [Condor-users] how to add machine at condor pool ?
Previous by thread: [Condor-users] Fwd: Re: Matlab job without compile
Next by thread: [Condor-users] how to add machine at condor pool ?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Job disconnected, reconnect Failed, Condor schedules another execution, now I have two of the same jobs running at once