[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting Failed Jobs to Restart



SIGQUIT and SIGKILL are treated specially for standard universe. Condor interprets them as meaning the job was kicked off a machine with or without a checkpoint, respectively. The other signals result in the job leaving the queue, and all signals result in the job leaving the queue in the vanilla universe.

With on_exit_remove, you can control when Condor lets the job leave the queue. For example, you can Condor to leave the job in the queue if it terminates with exit code 3. The expression is limited to information available in the job ad.

 -- Jaime

On Aug 26, 2005, at 10:30 AM, Avi Flamholz wrote:

Yes, but what if the job fails do to some internal exception - ie
uncaught exception in c++. Can Condor do anything there? I noticed
that in the standard universe, if I sent a running process sigkill
condor would register it as idle and then restart it in the next
negotiation cycle.

Can this be done in the standard universe?

Thanks for the response Jamie.
-Avi

On 8/26/05, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Aug 25, 2005, at 6:11 PM, Avi Flamholz wrote:


I am running a simple python script to test my condor configuration -
obviously in the vanilla universe. It simply computes the value of pi
for a while, times itself, and prints what machine it's on. I made it
run for a while so that I would have a chance to monitor it on the
remote machines.


The desired functionality is this - If a job fails (dies due to some
exception or failure) Condor should restart it from scratch. I have
notices that in the standard universe, condor can do this. Can it also
be done in the vanilla universe? What are the limitations. For the end
task it is unlikely that I will be able to relink the code, as it is
legacy material and there are few people around who know enough pascal
to know what it's doing. So I would like to be able to support this
functionality in the vanilla universe.



Condor automatically restarts all jobs that don't complete due to a Condor failure or being kicked off a machine. The standard universe does one better by restarting the jobs where they left off, instead of at the beginning.

You can also tell Condor to restart jobs that complete on their own
with the on_exit_remove expression.

+---------------------------------- +---------------------------------+
| Jaime Frey | Public Split on Whether |
| jfrey@xxxxxxxxxxx | Bush Is a Divider |
| http://www.cs.wisc.edu/~jfrey/ | -- CNN Scrolling Banner |
+---------------------------------- +---------------------------------+



_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users



_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users


+----------------------------------+---------------------------------+ | Jaime Frey | Public Split on Whether | | jfrey@xxxxxxxxxxx | Bush Is a Divider | | http://www.cs.wisc.edu/~jfrey/ | -- CNN Scrolling Banner | +----------------------------------+---------------------------------+