[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs don't run on execute machines



$CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
$CondorPlatform: INTEL-WINNT50 $

The pool is about a dozen Windows XP computers, most are 4-core with a
few 2-core machines. I am submitting from a 4-core machine which
potentially can also execute on all 4 cores, as can all the other
machines except the condor master machine; that one we limit to running
on just 3 cores in an attempt to not overload it.

The program run is a numerical model which is both cpu- and
disk-intensive, so using even 3 cores noticeably impacts the interactive
use of a given computer.

The problem is that after some time--perhaps a few hours--the jobs fail
to run on any execute machine, instead generating these errors:

024 (5193.000.000) 02/03 07:15:44 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (300 seconds) expired
    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling
job
...
022 (5193.000.000) 02/03 08:11:22 Job disconnected, attempting to
reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx
<136.200.32.179:4314>

This unwanted behavior *may* be triggered by my interactive use of the
submitting machine. I say this because, for instance, today hundreds of
jobs ran successfully overnight, only to start disconnecting when I
remotely logged in to the submitting machine to check progress.  Might
be a coincidence.

I wonder if I can prevent the disconnecting by running fewer or no jobs
on the submitting machine? Even though it has 4 cores, it is also
running 30-40 condor_shadows and receiving and sending a few 100MB per
job from and to the remote jobs. Having read about job leases in the
manual, it seems the problem lies with the submitting machine.  Or could
it be the condor master machine?

Ralph Finch, P.E.
Senior Engineer, W.R.
California Dept. of Water Resources
Bay-Delta Office, Delta Modeling Section
Room 215-13
1416 9th Street
Sacramento, CA 95814

916-653-7552
rfinch@xxxxxxxxxxxx