[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor held jobs should retry/release after certain configured timeout automatically



Hi,

Please see my comments inline:

On Tue, Apr 7, 2015 at 7:55 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
Hi Sridhar,

The configuration seems reasonable. However, weâd need more context to know if itâs working as expected.

1) Did you run condor_reconfig after changing the configuration?
I restarted condor using condor_restart. This should refresh config values, right?Â
2) Can you give an example classad of a job you think should be released under this policy?
I submitted a grid job where AMI ID is not valid. If AMI ID is not valid, job will go into held state. In this case, it should retry for configured no of times. make sense?Â

I actually want to useÂSYSTEM_PERIODIC_RELEASEÂto release jobs which are going held state because of service unavailable error from Amazon. Using above test to valid my configuration as it is not possible to testÂservice unavailable error condition now.


Â
Thanks,

Brian

On Apr 7, 2015, at 8:40 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:

Hi,

I addedÂSYSTEM_PERIODIC_RELEASE in my configuration(/etc/condor/config.d/00personal_condor.config). It seems, it is not releasing any jobs.Â

Can you please check the configuration and suggest me if anything is wrong?

SYSTEM_PERIODIC_RELEASE =(JobRunCount < 5 && (time() - EnteredCurrentStatus) > 600 )

On Tue, Feb 24, 2015 at 7:56 PM, Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
On Tue, Feb 24, 2015 at 7:43 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:

> I am using condor grid submit files for launching ec2 instances. Sometimes,
> when condor is trying to launch instances, it is getting
> InstanceLimitExceeded from aws. Due to this, condor jobs are going into held
> state.
>
> Is there way to avoid this scenario?

One solution is to request an limit increase from AWS (this may or may
not be desirable from a business perspective).

> or Do we have any configuration
> variable to retry/release held jobs after certain time period so that It
> will try and see whether able to execute or not?
>
There are several periodic expressions that might help. For example,
periodic_release defines when a job will be released
(SYSTEM_PERIODIC_RELEASE would apply to all jobs). In this case, you
might set a job to release after 10 minutes:

 periodic_release = (CurrentTime - EnteredCurrentStatus > 600)

See the condor_submit man page [1] and the schedd configuration
settings [2] for more details:

[1] http://research.cs.wisc.edu/htcondor/manual/v8.2/condor_submit.html
[2] http://research.cs.wisc.edu/htcondor/manual/v8.2/3_3Configuration.html#SECTION004311000000000000000


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/