[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Job Resubmission
- Date: Fri, 05 Oct 2012 07:47:05 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Job Resubmission
On 10/05/2012 07:22 AM, Giles Wright wrote:
Hi - Just getting started with Condor. We're looking at running jobs on
a local network of Linux machines with the potential of adding in a
connection to an Amazon VPC in the future. Have set up a couple of
virtual condor pools at 2 separate locations and have configured
flocking between them.
My question right now is how Condor deals with resubmitting failed jobs.
For example, should a machine die during execution a job, it seems that
the job terminates as you would expect, however we would like any failed
jobs to be resubmitted to another available machine in the pool. Should
we be looking at Condor-G (not started there yet) and condor_resubmit?
Condor provides more functionality around the resubmission use case than
most other schedulers. And the default policy is setup in such a way
that most Condor folks don't ever think about "resubmission."
Condor will keep your job in the queue (condor_schedd managed) until the
policy attached to the job says otherwise.
The default policy says a job will be run as many time as necessary for
the job to terminate. So if the machine a job is running on crashes
(generally, becomes unavailable), the condor_schedd will automatically
try to run the job on another machine.
When you start changing the default policy you can control things such
as: if a job should be removed after a period of time, even if it is
running or only if it hasn't started running; if a job should run
multiple times even if it terminated cleanly; if a termination w/ an
error should make the job run again, be held in the queue for
inspection, be removed from the queue; if a job held for inspection
should be held forever or a specific amount of time; if a job should
only start running at a specific time in the future, or be run at