[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] A lot of jobs in C state



Hi Todd,

Thank you very much. Yes, you're right, I did not establish any JOB_IS_FINISHED_COUNT value. I did not notice any strange behaviour running older HTCondor versions, maybe because I had less job finishing at the same time... I don't know, I started a long time ago using JOB_IS_FINISHED_INTERVAL with our initial HTCondor test poolÂand it seems that I missed the JOB_IS_FINISHED_COUNT macro in the updates.

Well, I think that you've solved my mistery :)

Thank you again.

Cheers,

Carles

On 6 September 2017 at 00:04, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 9/4/2017 11:06 AM, Carles Acosta wrote:
Hello again,

It seems that the issue was related withÂJOB_IS_FINISHED_INTERVAL. It was set at 10 seconds, but the jobs stayed for much longer as I've commented in my previous email. Removing JOB_IS_FINISHED_INTERVAL from the Schedd config, all seems to work correctly again. There are no more "actOnJobs: didn't do any work, aborting" messages in the Schedd for the last 7 hours.

I don't know if I misunderstand JOB_IS_FINISHED_INTERVAL macro. We're running HTCondor 8.6.5 and there were no issues related with JOB_IS_FINISHED_INTERVAL with previous versions, such as the 8.5.8, as far as we know.

Thank you very much.

Cheers,

Carles


Hi Carles,

Thanks for the follow-up post above.

You stated you had problems when your config had
 ÂJOB_IS_FINISHED_INTERVAL=10

What was JOB_IS_FINISHED_COUNT set to be?

If you change JOB_IS_FINISHED_INTERVAL to be 10, and don't also set JOB_IS_FINISHED_COUNT, the result is the schedd will only allow one job to leave the queue every 10 seconds!! I am guessing this is situation you encountered. Basically these two config knobs should always be changed together - see the below cut-n-paste from the manual. Note the default for JOB_IS_FINISHED_INTERVAL is 0, which is the same as not defining it.... i.e. the default configuration works. I am curious where/how you ended up with a setting of JOB_IS_FINISHED=10 without a corresponding JOB_IS_FINISHED_COUNT setting. I checked with OSG and the configuration they ship for the HTCondor-CE does not change either of these knobs.

Hope this helps
Todd

>From the Manual:

JOB_IS_FINISHED_COUNT
  An integer value representing the number of jobs that the condor_schedd will let permanently leave the job queue each time that it examines the jobs that are ready to do so. The default value is 1.

JOB_IS_FINISHED_INTERVAL
  The condor_schedd maintains a list of jobs that are ready to permanently leave the job queue, for example, when they have completed or been removed. This integer-valued macro specifies a delay in seconds between instances of taking jobs permanently out of the queue. The default value is 0, which tells the condor_schedd to not impose any delay.