[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job run time limit ?




we're using Condor to execute jobs which take a lot of time. We easily executed some which took 27 hours. Is there a max run time limit?

No, there is not maximum run time. Users can force one for their own jobs with the periodic_remove expression, but this is rarely done, and not your problem.


007 (044.009.000) 09/12 09:18:00 Shadow exception!
Can no longer talk to condor_starter on execute machine (192.168.1.15)
0 - Run Bytes Sent By Job
2176017 - Run Bytes Received By Job

What is in your StarterLog in the computer that the job was executing on at that time? (192.168.1.15) This information will be very helpful.


In the Shadow log :

9/12 09:12:10 (44.7) (10025): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C
9/12 09:12:57 (44.4) (10013): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.22)" at line 63 in file NTreceivers.C
9/12 09:13:04 (44.6) (10023): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C
9/12 09:14:06 (44.8) (10026): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.15)" at line 63 in file NTreceivers.C
9/12 09:14:14 (44.1) (10010): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.20)" at line 63 in file NTreceivers.C
9/12 09:14:18 (44.0) (10009): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.20)" at line 63 in file NTreceivers.C
9/12 09:15:00 (44.3) (10012): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.21)" at line 63 in file NTreceivers.C
9/12 09:15:06 (44.2) (10011): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.21)" at line 63 in file NTreceivers.C
9/12 09:15:14 (44.5) (10014): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.22)" at line 63 in file NTreceivers.C
9/12 09:18:00 (44.9) (10151): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.15)" at line 63 in file NTreceivers.C

It looks like the shadows--which watch over your jobs--lost contact with the starters--which starter the jobs and monitor them on the execution computer. All of them lost contact at about the same time.


It sounds to me like you lost network connectivity, or a shared disk system became unavailable, or something like that. Is that a possibility?

Condor 6.7 can deal with this much better than Condor 6.6 if your job is in the vanilla or Java universes: you can set it up so that a temporary network outage doesn't cause the job to stop, but will only cause a failure if the outage lasts longer than a certain time that you specify.

-alain