[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] A lot of jobs in C state

Dear all,

We were facing a weird issue with our condor pool (8.6.5) with HTCondor-CE (2.1.2). There were a huge amount of jobs in completed stated with LeaveJobInQueue evaluating to True. My surprise is that it seems magically solved after a service condor restart (but it's not the first condor restart I've tried).Â

Well, we've noticed that there were a lot of jobs in "C" state... withoutÂStageOutStart or Finish attributes since 3 days ago. There were no clear error messages in HTCondor-CE logs. So, we turned to check our HTCondor pool. There were several "actOnJobs: didn't do any work, aborting" messages in the Schedd log. After this morning restart, the messages disappeared and the jobs are releasing the queue normally after completed (the JOB_FINISHED_INTERVAL is 10 seconds).

The only change in our pool configuration was adding aÂNEGOTIATOR_SLOT_CONSTRAINT = ( !RegExp("td515", Machine) ) to avoid td515 machine to be negotiated while installing. Now this line is deleted but I don't think this is related...
Right now, I can see that the "actOnJobs: didn't do any work, aborting" are appearing again...

Any ideas?

Thank you very much.