[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] high schedd loading every 20mins

On 12/12/2016 5:02 AM, jiangxw@xxxxxxxxxxxxxxx wrote:
Dear all,
    In our condor cluster, we added more resources into condor about 10000.
    After that, every 20 mins there are some shadows killed by condor.
    How can I deal with that?

Jiang Xiaowei

Hi Jiang,

HTCondor should have no problems with 10,000 slots with its default configuration.

To get helpful responses from this email list, please do more work upfront to provide information to help troubleshoot. For instance, providing the following sort of information will greatly enhance the chances someone can help you find the problem:

1. What version of HTCondor are you running?  What platform are you using?

2. What does the ShadowLog say is happening at the time shadows are "killed" ?

3. What does the ScheddLog say is happening at the time shadows are killed ?

4. You say there is high schedd load - what makes you think this, i.e. what is the load average on your submit machine? What is the output from this command:

  condor_status -schedd -af name "floor(RecentDaemonCoreDutyCycle*100)"

This condor_status command will report how busy all the schedds in your pool are; numbers higher than 98 could be cause for concern, and anything lower means the schedd is doing just fine.

5. Did you customize your condor_config? Could you summarize what changes you made, especially on the submit or execute nodes? Maybe that is the source of problems.

6. The only time HTCondor would actually kill a shadow process is if it thinks that shadow process is unresponsive. The most likely reason for this is the shadow's are trying to append to job event logs (i.e. log=/home/file in the submit file) that reside on a shared filesystem like NFS, and that shared filesystem is overwhelmed and blocking all clients (like the shadow). Could you please grep for the string "appears hung" in the SchedLog and MasterLog on your submit machine?

7. Can your provide us instructions on how to reproduce the problems you are observing?

8. Is the problem happening all the time, or only sporadically? Can you correlate what triggers the problem? For instance, perhaps it only happens when a large batch of very short jobs (less than a second) appear?

Hope the above helps