Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] high schedd loading every 20mins

Date: Mon, 12 Dec 2016 11:06:36 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] high schedd loading every 20mins

On 12/12/2016 5:02 AM, jiangxw@xxxxxxxxxxxxxxx wrote:

Dear all,
    In our condor cluster, we added more resources into condor about 10000.
    After that, every 20 mins there are some shadows killed by condor.
    How can I deal with that?

Jiang Xiaowei


Hi Jiang,

HTCondor should have no problems with 10,000 slots with its defaultconfiguration.

To get helpful responses from this email list, please do more workupfront to provide information to help troubleshoot. For instance,providing the following sort of information will greatly enhance thechances someone can help you find the problem:


1. What version of HTCondor are you running?  What platform are you using?

2. What does the ShadowLog say is happening at the time shadows are"killed" ?


3. What does the ScheddLog say is happening at the time shadows are killed ?

4. You say there is high schedd load - what makes you think this, i.e.what is the load average on your submit machine? What is the outputfrom this command:


  condor_status -schedd -af name "floor(RecentDaemonCoreDutyCycle*100)"

This condor_status command will report how busy all the schedds in yourpool are; numbers higher than 98 could be cause for concern, andanything lower means the schedd is doing just fine.

5. Did you customize your condor_config? Could you summarize whatchanges you made, especially on the submit or execute nodes? Maybe thatis the source of problems.

6. The only time HTCondor would actually kill a shadow process is if itthinks that shadow process is unresponsive. The most likely reason forthis is the shadow's are trying to append to job event logs (i.e.log=/home/file in the submit file) that reside on a shared filesystemlike NFS, and that shared filesystem is overwhelmed and blocking allclients (like the shadow). Could you please grep for the string"appears hung" in the SchedLog and MasterLog on your submit machine?

7. Can your provide us instructions on how to reproduce the problems youare observing?

8. Is the problem happening all the time, or only sporadically? Can youcorrelate what triggers the problem? For instance, perhaps it onlyhappens when a large batch of very short jobs (less than a second) appear?


Hope the above helps
Todd

References:
- [HTCondor-users] high schedd loading every 20mins
  - From: jiangxw@xxxxxxxxxxxxxxx

Prev by Date: [HTCondor-users] high schedd loading every 20mins
Next by Date: Re: [HTCondor-users] Question about configuring a pool password
Previous by thread: [HTCondor-users] high schedd loading every 20mins
Next by thread: Re: [HTCondor-users] Question about configuring a pool password
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] high schedd loading every 20mins