[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting Failed Jobs to Restart

On Aug 25, 2005, at 6:11 PM, Avi Flamholz wrote:

I am running a simple python script to test my condor configuration -
obviously in the vanilla universe. It simply computes the value of pi
for a while, times itself, and prints what machine it's on. I made it
run for a while so that I would have a chance to monitor it on the
remote machines.

The desired functionality is this - If a job fails (dies due to some
exception or failure) Condor should restart it from scratch. I have
notices that in the standard universe, condor can do this. Can it also
be done in the vanilla universe? What are the limitations. For the end
task it is unlikely that I will be able to relink the code, as it is
legacy material and there are few people around who know enough pascal
to know what it's doing. So I would like to be able to support this
functionality in the vanilla universe.

Condor automatically restarts all jobs that don't complete due to a Condor failure or being kicked off a machine. The standard universe does one better by restarting the jobs where they left off, instead of at the beginning.

You can also tell Condor to restart jobs that complete on their own with the on_exit_remove expression.

|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |