[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Diagnosing the Queue



Running Condor 8.2.8 here, and am experiencing a lack of responsiveness when submitting jobs (this is mostly unusual), or running ‘condor_q’ or ‘condor_submit -debug’.  ‘condor_q’ does return after several minutes in some cases; in others it throws an error:

-- failed to fetch ads from: <our_scheduler_node_IP_address:51430>  :  <fqdn_of_same_scheduler_node>

 

This issue presumably started over the weekend when someone submitted a larger set of jobs (order of magnitude = 10x) than “usual.”  When ‘condor_q’ does finish, at the end the summary shows the following:

31823 jobs; 0 completed, 31786 removed, 19 idle, 12 running, 24 held, 0 suspended

 

I’m posting to see if anyone has insight into how to diagnose why the jobs aren’t running.  I believe the amount (>33k jobs submitted over three days) isn’t unprecedented.  Obviously I’m not a Condor subject-matter expert here, but am trying to grow into something close, by hook or by crook.

 

Thanks for any and all insights!

 

Eric