[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job Resubmission



Thanks for the quick response Matt.

I think I may have been being impatient. After leaving the jobs running for a while after simulating a machine crash, the jobs "flocked" over to an idle machine on our second pool. I think it may have taken 10 - 15 mins for this to happen though, but at least it's doing as it should.

Thanks,
Giles


-----Original Message-----
From: Matthew Farrellee [mailto:matt@xxxxxxxxxx]
Sent: 05 October 2012 12:47
To: Condor-Users Mail List
Cc: Giles Wright
Subject: Re: [Condor-users] Job Resubmission

On 10/05/2012 07:22 AM, Giles Wright wrote:
> Hi - Just getting started with Condor. We're looking at running jobs
> on a local network of Linux machines with the potential of adding in a
> connection to an Amazon VPC in the future. Have set up a couple of
> virtual condor pools at 2 separate locations and have configured
> flocking between them.
>
> My question right now is how Condor deals with resubmitting failed jobs.
> For example, should a machine die during execution a job, it seems
> that the job terminates as you would expect, however we would like any
> failed jobs to be resubmitted to another available machine in the
> pool. Should we be looking at Condor-G (not started there yet) and condor_resubmit?
>
> Thanks,
> Giles

Giles,

Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don't ever think about "resubmission."

Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.

When you start changing the default policy you can control things such
as: if a job should be removed after a period of time, even if it is running or only if it hasn't started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.

Best,


matt

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2221 / Virus Database: 2441/5310 - Release Date: 10/04/12


EastQuayIT Ltd is a limited company, registered in England and Wales with Registration no. 07595813. VAT No: GB 116 6924 08.

Any quotation above is based on the terms and conditions of business and commencement of the services is evidence of your acceptance to the same. This message, including any attachments, has been sent by EastQuayIT Ltd and is intended solely for the use of the person(s) to whom it is addressed. Its contents are confidential and if you are not the intended recipient, please could you delete this email from your system, without copying or disclosing its contents, and inform the sender by return e-mail that you have received this message. Email communications cannot be guaranteed to be secure, or free from computer viruses, therefore EastQuayIT Ltd does not accept legal responsibility for this message or its contents. The recipient is responsible for checking this message for viruses and verifying its authenticity before acting on the contents. Any views or opinions presented are solely those of the author and do not necessarily represent those of EastQuayIT Ltd.