[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_q hangs



Thank you Todd for the explanations and tips. I think our issue may be shared file system related. And I will investigate it more.

Thanks,

Zhuo


On 8/24/2017 4:27 PM, Todd Tannenbaum wrote:
On 8/24/2017 11:32 AM, Zhuo Zhang wrote:
Hi,

We have a small HTCondor pool with 13 nodes (1 master and 12 working
nodes) and each node has 24 cores. Cron jobs are set up on master node
and each cron job is a script which launches several DAGMan jobs
depending on different scenarios. But very often we see that there is no
response from running condor_q when there are several hundreds of
HTCondor jobs (each job requests one CPU) in the queue.

My question is what are the possible causes of condor_q hanging?

Thank you in advance,

Zhuo

Hi Zhuo,

It would be helpful to know a) what version of HTCondor are you using, b) what operating system, c) if your condor_q is running on the same machine as your condor_schedd, d) is there a shared file system involved?

Especially if you are running on Linux, there are no known issues with condor_q hanging with several hundreds (or even thousands) of jobs in the queue with current versions of HTCondor.  HOWEVER, if the condor_schedd is hanging because it is blocked writing to an overloaded shared file system (i.e. your NFS server is overloaded), that could be a different story ... in fairness to HTCondor, that would cause anything trying to access the shared volume to become unresponsive, not just HTCondor.  The schedd code is not as optimized on Windows or MacOS as it is on Linux, so if your submit machine
is running Windows for instance, maybe that could be a concern.

When your script runs condor_submit_dag, is it doing so from a current working directory that is a shared file system?   Could you submit instead from local disk?

You can see if your schedd is overloaded by examining the RecentDaemonCoreDutyCycle attribute in the schedd ad - if the value is over 0.99 (99%), indeed it is a sign that the schedd is being blocked/hanging, and the most likely cause is a slow/overloaded shared file system.  Here is an example of using condor_status to view the busiest submit machines in a pool:

% condor_status -schedd -af name recentdaemoncoredutycycle -sort -recentdaemoncoredutycycle | head
submit-XXX.xxx.xxx.edu 0.2218954354793896
submit-YYY.xxx.xxx..edu 0.1205609870993556
opt-a010.xxx.xxx.wisc.edu 0.07554425839541035

In the above example, you can see that none of our pool's submit machines are blocked or overloaded, as the busiest one is
reporting 0.22 (22%), far below the 99% concern.

Also if condor_q is unresponsive, it may help to try doing "condor_sos condor_q" which will ask the schedd to perform the
condor_q at a higher priority level.

Hope the above helps
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Zhuo Zhang, ASSISTT
I.M. Systems Group (IMSG), NOAA/NESDIS/STAR
5825 University Research Court, Suite 1500 (IMSG), Cube 1500-11
College Park, MD 20740
Tel: (240) 582-3585 (x23017)