[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job run time limit ?



On Fri, 04 Feb 2005 07:35:18 +0000, Mark Calleja
<M.Calleja@xxxxxxxxxxxxxxx> wrote:
> Alain Roy wrote:
> 
> > Condor 6.7 can deal with this much better than Condor 6.6 if your job
> is in the vanilla or Java universes:
> > you can set it up so that a temporary network outage doesn't cause
> the job to stop, but will only cause a
> > failure if the outage lasts longer than a certain time that you specify.
> 
> How exactly does one set this? A quick trawl through the 6.7 manual
> didn't prove very fruitful, so a pointer to the relevant entry would be
> much appreciated.

Errrm about 4 days ago the following conversation happened right here:

<quote>
> routine in the schedd? As the service is brought down does it reach 
> out to startds to tell it to terminate running jobs? Can I prevent 
> this so reboots are tolerated? Reboots are a necessary evil our 
> windows development environment unfortunatly.

The job lease duration controls the schedd reboot survival

http://www.cs.wisc.edu/condor/manual/v6.7/2_13Special_Environment.html#sec:Job-Lease

you must
1) make sure your execute machines will allow leasing
2) make sure your submitters include "job_lease_duration" in their
submit scripts

Are you sure both the above are happening...

(also note that if you are using the other 6.7 series functionality of
streaming output that this will prevent leasing from working)

</quote>

and

<quote>
>Excellent. This setting should definitly be in the help portion of the 
>manual for condor_submit -- I would never have to thought to look for 
>this information in section 2.

It's there, it's just buried amidst the list of options.

http://www.cs.wisc.edu/condor/manual/v6.7/condor_submit.html#39510

Search for "job_lease_duration". It's right under "stream_error".
</quote>

It should be noted that the condor manual is NOT something you can do
quick read of and get going. This is largely due to the way it is laid
out (a series of discrete blobs of info rather than a whole bunch of
quick start guides) and the inherent complexity of the system itself.
The former might be changed but I see no chance of the latter changing
anytime soon.

I would advise anyone reading this list thinking of using condor to
read the ENTIRE manual as it applies to them (you can skip condor-g if
you aren't using globus as well as any os specifics that do not apply)
before trying anything.

The time you invest in doing this will be rewarded....

after that using

site:http://www.cs.wisc.edu manual/v6.7 reboot

(for example)

gives as it's fifth entry:

http://www.cs.wisc.edu/condor/manual/v6.7.3/8_2Development_Release.html
<quote>
8.2.4 Version 6.7.0 
Release Notes: 


Version 6.7.0 contains all of the features, ports, and bug fixes from
the previous stable series, up to and including version 6.6.4. In
addition, a number of new features and some bug fixes have been made,
which are described below in more detail.

New Features: 


Added support for vanilla and Java jobs to reconnect when the
connection between the submitting and execution nodes is lost for any
reason. Possible reasons for this disconnect include: network outages,
rebooting the submit machine, restarting the Condor daemons on the
submit machine, etc. If the execution machine is rebooted or the
Condor daemons are restarted, reconnection is not possible. To take
advantage of this reconnect feature, jobs must be submitted with a
JobLeaseDuration. There are new events in the UserLog related to
disconnect and reconnect.
</quote>

It is a notable aspect of the docs that specfic features (especially
in the dev series) seem to get the main 'overall' explanation in the
release notes then have each aspect of the system (submission / startd
/ schedd) explain in it's own section how it is affected.

I think this could be improved on for the average new user (though
RTFM still applies!)

Matt