[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor held jobs should retry/release after certain configured timeout automatically



On Tue, Feb 24, 2015 at 7:43 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:

> I am using condor grid submit files for launching ec2 instances. Sometimes,
> when condor is trying to launch instances, it is getting
> InstanceLimitExceeded from aws. Due to this, condor jobs are going into held
> state.
>
> Is there way to avoid this scenario?

One solution is to request an limit increase from AWS (this may or may
not be desirable from a business perspective).

> or Do we have any configuration
> variable to retry/release held jobs after certain time period so that It
> will try and see whether able to execute or not?
>
There are several periodic expressions that might help. For example,
periodic_release defines when a job will be released
(SYSTEM_PERIODIC_RELEASE would apply to all jobs). In this case, you
might set a job to release after 10 minutes:

  periodic_release = (CurrentTime - EnteredCurrentStatus > 600)

See the condor_submit man page [1] and the schedd configuration
settings [2] for more details:

[1] http://research.cs.wisc.edu/htcondor/manual/v8.2/condor_submit.html
[2] http://research.cs.wisc.edu/htcondor/manual/v8.2/3_3Configuration.html#SECTION004311000000000000000


Thanks,
BC

-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing