[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job disconnected, reconnect Failed, Condor schedules another execution, now I have two of the same jobs running at once



Hi.  

I am submitting to a windows pool about 1000 jobs that should in total take about 4 hours.

A small cluster (~20 jobs) complete successfully, but on larger ones that run longer I am seeing some problems, failure rates from 1%-10%.  According to the log periodically it is loosing connection:

Grep'd:
000 (13272.000.000) 05/15 01:06:05 Job submitted from host: <10.44.7.143:49187>
001 (13272.000.000) 05/15 04:10:31 Job executing on host: <10.44.7.24:1050>
006 (13272.000.000) 05/15 04:10:39 Image size of job updated: 12540
006 (13272.000.000) 05/15 04:15:38 Image size of job updated: 136788
022 (13272.000.000) 05/15 06:15:39 Job disconnected, attempting to reconnect
024 (13272.000.000) 05/15 06:15:39 Job reconnection failed
001 (13272.000.000) 05/15 06:15:41 Job executing on host: <10.44.7.24:1052>
005 (13272.000.000) 05/15 06:15:50 Job terminated.


Detail:

...
022 (13272.000.000) 05/15 06:15:39 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to condor-05.lggm.llc <10.44.7.24:1050>
...
024 (13272.000.000) 05/15 06:15:39 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (1200 seconds) expired
    Can not reconnect to condor-05.lggm.llc, rescheduling job

then it gets rerun

...
001 (13272.000.000) 05/15 06:15:41 Job executing on host: <10.44.7.24:1052>
...
005 (13272.000.000) 05/15 06:15:50 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
884  -  Run Bytes Sent By Job
1136  -  Run Bytes Received By Job
884  -  Total Bytes Sent By Job
1136  -  Total Bytes Received By Job


My problem is that the job outputs to file that is on a shared filesystem.  And from my own application's log, I can tell that there are two of them running at the same time.  Both of them try to access the output file and one of them fails.

This can make some noisy results, because one will keep failing (I am using 5 Retry's in the requirements with JobRunCount and ExitCode == 0).  If the job can't write to the output location it is generally considered an error.  I never expected condor would be responsible for having the same job run in parallel at the same time.

Should the first job be evicted when condor looses communication?  I have my NEGOTIATOR_INTERVAL set to 30 seconds, is that conflicting with some other timer thats on a default?

My submit file looks like this.

Universe           = vanilla
Log                = CONDOR.log
run_as_owner       = true
requirements       = substr(OpSys,0,5) == "WINNT"
concurrency_limits = ERDASENGINE
on_exit_remove     = ExitCode == 0 || (ExitCode != 0 && JobRunCount >= 5)
+pid               = "96_11"
Executable         = 96_11_resampleprocess_24011_img_0.bat
Output             = 96_11_resampleprocess_24011_img_0.out
Error              = 96_11_resampleprocess_24011_img_0.err
+eid               = "resampleprocess_24011_img_0"
Queue
.
.



--Derrick