[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor held jobs should retry/release after certain configured timeout automatically




On Apr 7, 2015, at 9:42 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:

Hi,

Please see my comments inline:

On Tue, Apr 7, 2015 at 7:55 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx>wrote:
Hi Sridhar,

The configuration seems reasonable.  However, weâd need more context to know if itâs working as expected.

1) Did you run condor_reconfig after changing the configuration?
I restarted condor using condor_restart. This should refresh config values, right?

Yup, that should be fine.

 
2) Can you give an example classad of a job you think should be released under this policy?
I submitted a grid job where AMI ID is not valid. If AMI ID is not valid, job will go into held state. In this case, it should retry for configured no of times. make sense? 

I actually want to use SYSTEM_PERIODIC_RELEASE to release jobs which are going held state because of service unavailable error from Amazon. Using above test to valid my configuration as it is not possible to test service unavailable error condition now.



Yes - I understood this part.  However, to understand why itâs not doing what you think it should, weâd need to actually see the classad.

Brian