[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] A lot of jobs in C state



Hello again,

It seems that the issue was related withÂJOB_IS_FINISHED_INTERVAL. It was set at 10 seconds, but the jobs stayed for much longer as I've commented in my previous email. Removing JOB_IS_FINISHED_INTERVAL from the Schedd config, all seems to work correctly again. There are no moreÂÂ"actOnJobs: didn't do any work, aborting" messages in the Schedd for the last 7 hours.Â

I don't know if I misunderstand JOB_IS_FINISHED_INTERVAL macro. We're running HTCondor 8.6.5 and there were no issues related with JOB_IS_FINISHED_INTERVAL with previous versions, such as the 8.5.8, as far as we know.

Thank you very much.

Cheers,

CarlesÂ

On 1 September 2017 at 14:57, Carles Acosta <cacosta@xxxxxx> wrote:
Dear all,

We were facing a weird issue with our condor pool (8.6.5) with HTCondor-CE (2.1.2). There were a huge amount of jobs in completed stated with LeaveJobInQueue evaluating to True. My surprise is that it seems magically solved after a service condor restart (but it's not the first condor restart I've tried).Â

Well, we've noticed that there were a lot of jobs in "C" state... withoutÂStageOutStart or Finish attributes since 3 days ago. There were no clear error messages in HTCondor-CE logs. So, we turned to check our HTCondor pool. There were several "actOnJobs: didn't do any work, aborting" messages in the Schedd log. After this morning restart, the messages disappeared and the jobs are releasing the queue normally after completed (the JOB_FINISHED_INTERVAL is 10 seconds).

The only change in our pool configuration was adding aÂNEGOTIATOR_SLOT_CONSTRAINT = ( !RegExp("td515", Machine) ) to avoid td515 machine to be negotiated while installing. Now this line is deleted but I don't think this is related...
Â
Right now, I can see that the "actOnJobs: didn't do any work, aborting" are appearing again...

Any ideas?

Thank you very much.

Cheers,

Carles