[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] counting licenses



On 5/6/05, Joshua Kolden <joshua@xxxxxxxxxxxxxxxxx> wrote:
> 
> >Condor is very much about the individual users having a reasonable
> >awareness of the impact of their jobs on the wider world and
> >throttling as they see fit.
> >
> >
<snip>
>  Alfred from Pixar, although not the
> best queue software, has a 'ping' system which allows one to run any
> command before a job is started, very easy to implement, and very
> effective for global management.

Do you mean from the submitting machine or the execution machine?

if it's the execution machine you can just submit a script and
transfer the exe separately and have the script invoke the exe.
Obviously this means you cannot guarantee any post execute commands will run.
The alternate is the dag mans pre and post scripts (and just have a
dag with one job - a bit of a hammer to crack a nut but perfectly
do-able)

> Some systems that don't offer global resource monitoring do allow you to
> return a failure from a job that is understood to mean try again in a
> little bit.  Such as a license failure return code.  Such a failure,
> causes the job to no try to submit a new task for a set amout of time,
> or until an exsisting task finishes.  Unlike a normal failure, a
> resource failure return code never causes the task to be marked failed,
> it just keeps trying until it gets the resource.  If there is not such a
> system in Condor I would strongly encourage it's addition.  It's the
> state of the art for visual effects queues circa 1995.  It doesn't solve
> the submition logic quite like we need, but it's better then nothing.

condor has this:

http://www.cs.wisc.edu/condor/manual/v6.7/condor_submit.html
<quote>

on_exit_remove = <ClassAd Boolean Expression> 
This expression is checked when the job exits and if true, then it
allows the job to leave the queue normally. If false, then the job is
placed back into the Idle state. If the user job runs under the
vanilla universe, then the job restarts from the beginning. If the
user job runs under the standard universe, then it continues from
where it left off, using the last checkpoint.
For example, suppose you have a job that occasionally segfaults, but
you know if you run the job again with the same data, chances are that
the will finish successfully. This is how you would represent that
with on_exit_remove (assuming the signal identifier for segmentation
fault is 11 on the platform where your job will be running):


on_exit_remove = (ExitBySignal == False) || (ExitSignal != 11)

This expression will only let the job leave the queue if the job was
not killed by a signal (it exited normally on its own) or if it was
killed by a signal other than 11 (representing segmentation fault).
So, if it was killed by signal 11, it will stay in the job queue. In
any other case of the job exiting, the job will leave the queue as it
normally would have done.

As another example, if your job should only leave the queue if it
exited on its own with status 0, you would use this on_exit_remove
expression:


on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)

If the job was killed by a signal or exited with a non-zero exit
status, Condor would leave the job in the queue to run again.

If left unspecified, the on_exit_remove expression will default to True. 

periodic_* expressions take precedence over on_exit_* expressions, and
*_hold expressions take precedence over a *_remove expressions.

This expression is available for the vanilla and java universes. It is
additionally available, when submitted from a Unix machine, for the
standard universe. Note that the condor_ schedd daemon, by default,
only checks these periodic expressions once every 300 seconds. The
period of these evaluations can be adjusted by setting the
PERIODIC_EXPR_INTERVAL configuration macro.


on_exit_hold = <ClassAd Boolean Expression> 
This expression is checked when the job exits and if true, places the
job on hold. If false then nothing happens and the on_exit_remove
expression is checked to determine if that needs to be applied.
For example: Suppose a job is known to run for a minimum of an hour.
If the job exits after less than an hour, the job should be placed on
hold and an e-mail notification sent, instead of being allowed to
leave the queue.


on_exit_hold = (CurrentTime - JobStartDate) < (60 * $(MINUTE))

This expression places the job on hold if it exits for any reason
before running for an hour. An e-mail will be sent to the user
explaining that the job was placed on hold because this expression
became True.

periodic_* expressions take precedence over on_exit_* expressions, and
*_hold expressions take precedence over a *_remove expressions.

If left unspecified, this will default to False. 

This expression is available for the vanilla and java universes. It is
additionally available, when submitted from a Unix machine, for the
standard universe.

</quote>

hope that helps
Matt