[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_q hangs



On 8/24/2017 11:32 AM, Zhuo Zhang wrote:
> Hi,
> 
> We have a small HTCondor pool with 13 nodes (1 master and 12 working 
> nodes) and each node has 24 cores. Cron jobs are set up on master node 
> and each cron job is a script which launches several DAGMan jobs 
> depending on different scenarios. But very often we see that there is no 
> response from running condor_q when there are several hundreds of 
> HTCondor jobs (each job requests one CPU) in the queue.
> 
> My question is what are the possible causes of condor_q hanging?
> 
> Thank you in advance,
> 
> Zhuo
> 

Hi Zhuo,

It would be helpful to know a) what version of HTCondor are you using, b) what operating system, c) if your condor_q is running on the same machine as your condor_schedd, d) is there a shared file system involved?

Especially if you are running on Linux, there are no known issues with condor_q hanging with several hundreds (or even thousands) of jobs in the queue with current versions of HTCondor.  HOWEVER, if the condor_schedd is hanging because it is blocked writing to an overloaded shared file system (i.e. your NFS server is overloaded), that could be a different story ... in fairness to HTCondor, that would cause anything trying to access the shared volume to become unresponsive, not just HTCondor.  The schedd code is not as optimized on Windows or MacOS as it is on Linux, so if your submit machine
is running Windows for instance, maybe that could be a concern.

When your script runs condor_submit_dag, is it doing so from a current working directory that is a shared file system?   Could you submit instead from local disk?

You can see if your schedd is overloaded by examining the RecentDaemonCoreDutyCycle attribute in the schedd ad - if the value is over 0.99 (99%), indeed it is a sign that the schedd is being blocked/hanging, and the most likely cause is a slow/overloaded shared file system.  Here is an example of using condor_status to view the busiest submit machines in a pool:

% condor_status -schedd -af name recentdaemoncoredutycycle -sort -recentdaemoncoredutycycle | head
submit-XXX.xxx.xxx.edu 0.2218954354793896
submit-YYY.xxx.xxx..edu 0.1205609870993556
opt-a010.xxx.xxx.wisc.edu 0.07554425839541035

In the above example, you can see that none of our pool's submit machines are blocked or overloaded, as the busiest one is
reporting 0.22 (22%), far below the 99% concern.

Also if condor_q is unresponsive, it may help to try doing "condor_sos condor_q" which will ask the schedd to perform the 
condor_q at a higher priority level.

Hope the above helps
Todd

> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685