Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Job run time limit ?
- Date: Thu, 16 Sep 2004 09:42:35 -0500
- From: David Kotz <dkotz@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Job run time limit ?
I don't know what's killing your jobs, but I can tell you that in my
cluster I routinely have jobs which run for months.
I would check the obvious first --
* do you have space in the spool directory?
* has your execute machine logged any errors?
* is the job consuming all available memory and swap?
* will you program run to completion outside of Condor?
- dave
On Thu, 2004-09-16 at 02:07, Manuel Bardiès wrote:
> No reply to our question it seems...
> Still I don't know how to solve that problem.
> Trouble is that we have jobs that last far more that 30 hours, even
> ona cluster...
>
> Any hint/idea someone?
>
> Thanks,
>
> M Bardies
>
> Le 13 sept. 04, à 15:29, Jérôme Jaglale a écrit :
>
> Hi,
>
> we're using Condor to execute jobs which take a lot of time.
> Weeasily executed some which took 27 hours. Is there a max run
> timelimit ? Because we launched a longer job, and it stopped
> afterapproximately 65 hours (we tried again two times) :
>
> 000 (044.009.000) 09/09 15:44:56 Jobsubmitted from host:
> <172.18.45.80:51293>
> 001 (044.009.000) 09/09 15:50:18 Job executing on
> host:<192.168.1.15:49234>
> ......
> 007 (044.009.000) 09/12 09:18:00 Shadow exception!
> Can no longer talk to condor_starter on execute machine
> (192.168.1.15)
> 0 - Run Bytes Sent By Job
> 2176017 - Run Bytes Received By Job
>
>
> In the Shadow log :
>
> 9/12 09:12:10 (44.7) (10025): ERROR"Can no longer talk to
> condor_starter on execute machine(192.168.1.23)" at line 63 in
> file NTreceivers.C
> 9/12 09:12:57 (44.4) (10013): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.22)" at line 63
> in fileNTreceivers.C
> 9/12 09:13:04 (44.6) (10023): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.23)" at line 63
> in fileNTreceivers.C
> 9/12 09:14:06 (44.8) (10026): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.15)" at line 63
> in fileNTreceivers.C
> 9/12 09:14:14 (44.1) (10010): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.20)" at line 63
> in fileNTreceivers.C
> 9/12 09:14:18 (44.0) (10009): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.20)" at line 63
> in fileNTreceivers.C
> 9/12 09:15:00 (44.3) (10012): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.21)" at line 63
> in fileNTreceivers.C
> 9/12 09:15:06 (44.2) (10011): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.21)" at line 63
> in fileNTreceivers.C
> 9/12 09:15:14 (44.5) (10014): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.22)" at line 63
> in fileNTreceivers.C
> 9/12 09:18:00 (44.9) (10151): ERROR "Can no longer talk
> tocondor_starter on execute machine (192.168.1.15)" at line 63
> in fileNTreceivers.C
>
> Thanks for your help,
> Jérôme Jaglale
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> Manuel Bardiès
> INSERM UMR 601
> 9 Quai Moncousu
> 44093 Nantes cedex
> -----------------------------
> Tel: 02 40 41 28 21
> Fax: 02 40 35 66 97
> Sec: 02 40 08 47 47
>
>
>
> ______________________________________________________________________
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users