[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job Resubmission

On 10/05/2012 07:22 AM, Giles Wright wrote:
Hi - Just getting started with Condor. We're looking at running jobs on
a local network of Linux machines with the potential of adding in a
connection to an Amazon VPC in the future. Have set up a couple of
virtual condor pools at 2 separate locations and have configured
flocking between them.

My question right now is how Condor deals with resubmitting failed jobs.
For example, should a machine die during execution a job, it seems that
the job terminates as you would expect, however we would like any failed
jobs to be resubmitted to another available machine in the pool. Should
we be looking at Condor-G (not started there yet) and condor_resubmit?



Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don't ever think about "resubmission."

Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.

When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn't started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.