[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job



I'm running 7.6.4 Todd, but thanks for the reply.

Is TCP_KEEPALIVE a condor configuration value? and is it safe to lower it (to 30 mins) or will it have any adverse affects?

Thanks
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: 22 November 2011 16:48
To: Condor-Users Mail List
Subject: Re: [Condor-users] Jobs lingering in queue if target shuts down        mid-job

Lukas Slebodnik wrote:
> Hi Thomass,
>
> I think that decreasing values of variables MAX_CLAIM_ALIVES_MISSED
> and ALIVE_INTERVAL will help you.
>
> Details in manual:
> http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:AliveInterval
>
> http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:MaxClaimAlivesMissed
>
>
> Regards, Lukas
>
> On Tue, Nov 22, 2011 at 01:59:01PM +0000, Thomas Luff wrote:
>> If a target machine shutsdown/crashes whilst a job is running on
>> the machine the job will hang around in the queue with the status
>> 'Running'.
>>
>> Even if the machine is shutdown and left off, the job still acts as
>> if it's running and has been like this for over an hour now.
>>
>> Is it possible to make these jobs automatically fail or requeue if
>> the target machine goes down?
>>
>> Thanks

Thoughts:

- in the event of a execute machine crashing, the job should
automatically requeue after a maximum of 2 hrs (the TCP KEEPALIVE
timeout). we plan to enhance this from 2 hrs max to instead be the job
lease duration (default of 20 minutes) in a future developer series release.

- are you running the developer series of Condor? there was a bug
introduced in Condor v7.7.2 that could result in the job staying running
indefinitely depending on your configuration. we have it fixed for the
upcoming v7.7.4 release. details at
   https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2591

regards,
Todd

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.