[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs don't run on execute machines



I should have mentioned that everything is set right again by issuing a
condor_restart -all.  But of course I'd rather find the root of the
problem.

RF

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Finch, Ralph
> Sent: Wednesday, February 03, 2010 9:25 AM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Jobs don't run on execute machines
> 
> $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> $CondorPlatform: INTEL-WINNT50 $
> 
> The pool is about a dozen Windows XP computers, most are 4-core with a
> few 2-core machines. I am submitting from a 4-core machine which
> potentially can also execute on all 4 cores, as can all the other
> machines except the condor master machine; that one we limit to
running
> on just 3 cores in an attempt to not overload it.
> 
> The program run is a numerical model which is both cpu- and
> disk-intensive, so using even 3 cores noticeably impacts the
> interactive
> use of a given computer.
> 
> The problem is that after some time--perhaps a few hours--the jobs
fail
> to run on any execute machine, instead generating these errors:
> 
> 024 (5193.000.000) 02/03 07:15:44 Job reconnection failed
>     Job disconnected too long: JobLeaseDuration (300 seconds) expired
>     Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling
> job
> ...
> 022 (5193.000.000) 02/03 08:11:22 Job disconnected, attempting to
> reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx
> <136.200.32.179:4314>
> 
> This unwanted behavior *may* be triggered by my interactive use of the
> submitting machine. I say this because, for instance, today hundreds
of
> jobs ran successfully overnight, only to start disconnecting when I
> remotely logged in to the submitting machine to check progress.  Might
> be a coincidence.
> 
> I wonder if I can prevent the disconnecting by running fewer or no
jobs
> on the submitting machine? Even though it has 4 cores, it is also
> running 30-40 condor_shadows and receiving and sending a few 100MB per
> job from and to the remote jobs. Having read about job leases in the
> manual, it seems the problem lies with the submitting machine.  Or
> could
> it be the condor master machine?
> 
> Ralph Finch, P.E.
> Senior Engineer, W.R.
> California Dept. of Water Resources
> Bay-Delta Office, Delta Modeling Section
> Room 215-13
> 1416 9th Street
> Sacramento, CA 95814
> 
> 916-653-7552
> rfinch@xxxxxxxxxxxx
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/