[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job run time limit ?



I don't know what's killing your jobs, but I can tell you that in my
cluster I routinely have jobs which run for months.

I would check the obvious first -- 

* do you have space in the spool directory?
* has your execute machine logged any errors?
* is the job consuming all available memory and swap?
* will you program run to completion outside of Condor?

- dave


On Thu, 2004-09-16 at 02:07, Manuel Bardiès wrote:
> No reply to our question it seems...
> Still I don't know how to solve that problem.
> Trouble is that we have jobs that last far more that 30 hours, even
> ona cluster...
> 
> Any hint/idea someone?
> 
> Thanks,
> 
> M Bardies
> 
> Le 13 sept. 04, à 15:29, Jérôme Jaglale a écrit :
> 
>         Hi,
>         
>         we're using Condor to execute jobs which take a lot of time.
>         Weeasily executed some which took 27 hours. Is there a max run
>         timelimit ? Because we launched a longer job, and it stopped
>         afterapproximately 65 hours (we tried again two times) :
>         
>         000 (044.009.000) 09/09 15:44:56 Jobsubmitted from host:
>         <172.18.45.80:51293>
>         001 (044.009.000) 09/09 15:50:18 Job executing on
>         host:<192.168.1.15:49234>
>         ......
>         007 (044.009.000) 09/12 09:18:00 Shadow exception!
>         Can no longer talk to condor_starter on execute machine
>         (192.168.1.15)
>         0  -  Run Bytes Sent By Job
>         2176017  -  Run Bytes Received By Job
>         
>         
>         In the Shadow log : 
>         
>         9/12 09:12:10 (44.7) (10025): ERROR"Can no longer talk to
>         condor_starter on execute machine(192.168.1.23)" at line 63 in
>         file NTreceivers.C
>         9/12 09:12:57 (44.4) (10013): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.22)" at line 63
>         in fileNTreceivers.C
>         9/12 09:13:04 (44.6) (10023): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.23)" at line 63
>         in fileNTreceivers.C
>         9/12 09:14:06 (44.8) (10026): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.15)" at line 63
>         in fileNTreceivers.C
>         9/12 09:14:14 (44.1) (10010): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.20)" at line 63
>         in fileNTreceivers.C
>         9/12 09:14:18 (44.0) (10009): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.20)" at line 63
>         in fileNTreceivers.C
>         9/12 09:15:00 (44.3) (10012): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.21)" at line 63
>         in fileNTreceivers.C
>         9/12 09:15:06 (44.2) (10011): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.21)" at line 63
>         in fileNTreceivers.C
>         9/12 09:15:14 (44.5) (10014): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.22)" at line 63
>         in fileNTreceivers.C
>         9/12 09:18:00 (44.9) (10151): ERROR "Can no longer talk
>         tocondor_starter on execute machine (192.168.1.15)" at line 63
>         in fileNTreceivers.C
>         
>         Thanks for your help,
>         Jérôme Jaglale
>         _______________________________________________
>         Condor-users mailing list
>         Condor-users@xxxxxxxxxxx
>         http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>         
> Manuel Bardiès
> INSERM UMR 601
> 9 Quai Moncousu
> 44093 Nantes cedex
> -----------------------------
> Tel:   02 40 41 28 21
> Fax:  02 40 35 66 97
> Sec:  02 40 08 47 47
> 
> 
> 
> ______________________________________________________________________
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users