[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] high schedd loading every 20mins
- Date: Mon, 12 Dec 2016 11:06:36 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] high schedd loading every 20mins
On 12/12/2016 5:02 AM, jiangxw@xxxxxxxxxxxxxxx wrote:
In our condor cluster, we added more resources into condor about 10000.
After that, every 20 mins there are some shadows killed by condor.
How can I deal with that?
HTCondor should have no problems with 10,000 slots with its default
To get helpful responses from this email list, please do more work
upfront to provide information to help troubleshoot. For instance,
providing the following sort of information will greatly enhance the
chances someone can help you find the problem:
1. What version of HTCondor are you running? What platform are you using?
2. What does the ShadowLog say is happening at the time shadows are
3. What does the ScheddLog say is happening at the time shadows are killed ?
4. You say there is high schedd load - what makes you think this, i.e.
what is the load average on your submit machine? What is the output
from this command:
condor_status -schedd -af name "floor(RecentDaemonCoreDutyCycle*100)"
This condor_status command will report how busy all the schedds in your
pool are; numbers higher than 98 could be cause for concern, and
anything lower means the schedd is doing just fine.
5. Did you customize your condor_config? Could you summarize what
changes you made, especially on the submit or execute nodes? Maybe that
is the source of problems.
6. The only time HTCondor would actually kill a shadow process is if it
thinks that shadow process is unresponsive. The most likely reason for
this is the shadow's are trying to append to job event logs (i.e.
log=/home/file in the submit file) that reside on a shared filesystem
like NFS, and that shared filesystem is overwhelmed and blocking all
clients (like the shadow). Could you please grep for the string
"appears hung" in the SchedLog and MasterLog on your submit machine?
7. Can your provide us instructions on how to reproduce the problems you
8. Is the problem happening all the time, or only sporadically? Can you
correlate what triggers the problem? For instance, perhaps it only
happens when a large batch of very short jobs (less than a second) appear?
Hope the above helps