[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Retrying a task



> Is there a simple way (e.g. via condor_submit) to ask Condor to retry
a
> task (ideally up to N times) if it fails (exits with a non-zero code)?


In the submit ticket you can use the on_exit_remove setting
(http://www.cs.wisc.edu/condor/manual/v6.6/condor_submit.html#37964) to
prevent Condor from marking a job as "completed" and letting it leave
the queue. If this expression evaluates to True the job is deemed to
have run to completion and allowed to leave the queue. If it evaluates
to false it returns to the queue to be run again.

You can use the job's ExitSignal and JobRunCount attributes to build
your expression.

For example you could say:

on_exit_remove = ExitSignal != 0 || JobRunCount > 5

This approach isn't perfect. JobRunCount is incremented any time a job
is run so if you have a lot of evictions going on in your system
JobRunCount can get really big before your job even gets a chance to run
to completion. Tune to suit your needs and your system behaviour.

- Ian