[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] A lot of jobs in C state

On 9/4/2017 11:06 AM, Carles Acosta wrote:
Hello again,

It seems that the issue was related withÂJOB_IS_FINISHED_INTERVAL. It was set at 10 seconds, but the jobs stayed for much longer as I've commented in my previous email. Removing JOB_IS_FINISHED_INTERVAL from the Schedd config, all seems to work correctly again. There are no more "actOnJobs: didn't do any work, aborting" messages in the Schedd for the last 7 hours.

I don't know if I misunderstand JOB_IS_FINISHED_INTERVAL macro. We're running HTCondor 8.6.5 and there were no issues related with JOB_IS_FINISHED_INTERVAL with previous versions, such as the 8.5.8, as far as we know.

Thank you very much.



Hi Carles,

Thanks for the follow-up post above.

You stated you had problems when your config had

What was JOB_IS_FINISHED_COUNT set to be?

If you change JOB_IS_FINISHED_INTERVAL to be 10, and don't also set JOB_IS_FINISHED_COUNT, the result is the schedd will only allow one job to leave the queue every 10 seconds!! I am guessing this is situation you encountered. Basically these two config knobs should always be changed together - see the below cut-n-paste from the manual. Note the default for JOB_IS_FINISHED_INTERVAL is 0, which is the same as not defining it.... i.e. the default configuration works. I am curious where/how you ended up with a setting of JOB_IS_FINISHED=10 without a corresponding JOB_IS_FINISHED_COUNT setting. I checked with OSG and the configuration they ship for the HTCondor-CE does not change either of these knobs.

Hope this helps

From the Manual:

An integer value representing the number of jobs that the condor_schedd will let permanently leave the job queue each time that it examines the jobs that are ready to do so. The default value is 1.

The condor_schedd maintains a list of jobs that are ready to permanently leave the job queue, for example, when they have completed or been removed. This integer-valued macro specifies a delay in seconds between instances of taking jobs permanently out of the queue. The default value is 0, which tells the condor_schedd to not impose any delay.