Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting Failed Jobs to Restart

Date: Fri, 26 Aug 2005 11:30:23 -0500
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [Condor-users] Getting Failed Jobs to Restart

SIGQUIT and SIGKILL are treated specially for standard universe. Condor interprets them as meaning the job was kicked off a machine with or without a checkpoint, respectively. The other signals result in the job leaving the queue, and all signals result in the job leaving the queue in the vanilla universe.

With on_exit_remove, you can control when Condor lets the job leave the queue. For example, you can Condor to leave the job in the queue if it terminates with exit code 3. The expression is limited to information available in the job ad.

 -- Jaime

On Aug 26, 2005, at 10:30 AM, Avi Flamholz wrote:

Yes, but what if the job fails do to some internal exception - ie
uncaught exception in c++. Can Condor do anything there? I noticed
that in the standard universe, if I sent a running process sigkill
condor would register it as idle and then restart it in the next
negotiation cycle.
Can this be done in the standard universe?
Thanks for the response Jamie.
-Avi
On 8/26/05, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
On Aug 25, 2005, at 6:11 PM, Avi Flamholz wrote:
I am running a simple python script to test my condor configuration - obviously in the vanilla universe. It simply computes the value of pi for a while, times itself, and prints what machine it's on. I made it run for a while so that I would have a chance to monitor it on the remote machines.

The desired functionality is this - If a job fails (dies due to some exception or failure) Condor should restart it from scratch. I have notices that in the standard universe, condor can do this. Can it also be done in the vanilla universe? What are the limitations. For the end task it is unlikely that I will be able to relink the code, as it is legacy material and there are few people around who know enough pascal to know what it's doing. So I would like to be able to support this functionality in the vanilla universe.
Condor automatically restarts all jobs that don't complete due to a
Condor failure or being kicked off a machine. The standard universe
does one better by restarting the jobs where they left off, instead
of at the beginning.
You can also tell Condor to restart jobs that complete on their own
with the on_exit_remove expression.
+---------------------------------- +---------------------------------+ | Jaime Frey | Public Split on Whether | | jfrey@xxxxxxxxxxx | Bush Is a Divider | | http://www.cs.wisc.edu/~jfrey/ | -- CNN Scrolling Banner | +---------------------------------- +---------------------------------+
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+

References:
- [Condor-users] Getting Failed Jobs to Restart
  - From: Avi Flamholz
- Re: [Condor-users] Getting Failed Jobs to Restart
  - From: Jaime Frey
- Re: [Condor-users] Getting Failed Jobs to Restart
  - From: Avi Flamholz

Prev by Date: Re: [Condor-users] Unix group membeship & condor pool selection?
Next by Date: Re: [Condor-users] Condor-G - job submission problem
Previous by thread: Re: [Condor-users] Getting Failed Jobs to Restart
Next by thread: [Condor-users] bug in CondorView : LANG setting
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Getting Failed Jobs to Restart