[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] efficiency question


On Friday, April 8, 2011 at 8:04 PM, Rita wrote:

Our environment has 500 servers dedicated to condor.  We have 6 schedulers and daily we run about 300k jobs. We use dynamic slots 

What is the most efficient way to do this:
1) Get total running jobs in the pool
  I use condor_status -claimed. 
That's definitely the least invasive way to get the information. But it's not always 100% accurate. There are delays in updates from executors and schedulers that will mean anything gotten from the collector (which is where condor_status generally queries) can be inaccurate. If you require really precise, time aligned counts, your best bet is to query all schedulers in some multi-threaded approach and collate the output. Or look at Quill++ -- but that has its own problems. 

You can also look at condor_status -schedd -- which gives you a summary of jobs at each scheduler in your pool.

You can use condor_q -global but this is round robin and depending on how long it takes to traverse and query all your schedulers you can end up with out-of-date information between the first and final schedulers you query.

Also, condor_q calls should be minimized on schedulers that are heavily loaded down with jobs as they interrupt the condor_schedd process and slow down matchmaking and new job spawns.
2) Get total held jobs in the pool
I use condor_q -global -held 
condor_status -schedd can work here as well. Thought it's not 100% accurate information. See above.
3) The most recent negotiation cycle and how many were negotiated 
Parse a log file?
Yup. That's the only way to get that data. You can use condor_fetchlog to pull the log file from the negotiator if that makes it easier. Big log files probably shouldn't be retrieved with that command line.
4) Top 10 users who are requesting the most memory and their jobs are running now
Check for request_memory?
This is data you'd have to assemble from an aggregate of all your current scheduler job ads. For most simple jobs ImageSize (or ImageSize_RAW) is the attribute you want to look at -- that's the amount of memory the user thought the job would need when they submitted it. If you're trying to get actual memory used ImageSize, for running jobs, can help because Condor updates it to reflect actual memory used once a job starts to run, but it's not always accurate. Tracking memory use for multi-process, multi-threaded process trees is no simple task.

If you tried to use RequestMemory you'd have to get in to parsing and evaluating full-on ClassAd expressions. For my jobs this attribute shows up looking like:

RequestMemory = ceiling(ImageSize/1024.0)

So it's not just a float or int.
5) Top 10 users who are requesting the most CPU and their jobs are running now
By "most CPU" do you mean number of cores or actual time on spent on the CPU?  You can get the time on the CPU with the RemoteSysCpu and RemoteUserCpu attributes. If all you want is how many cores the job asked for, RequestCpus is the right attribute, but like RequestMemory it can be an _expression_.
Basically, it boils down to using condor_status and condor_q. I haven't found a way to query the negotiator deamon for statistics. 
Yea, unfortunately there's no way to query the negotiator for stats information. 
BTW, I plan on running these commands on a 5 second basis.
That kind of frequency for condor_q calls could be deadly to your schedulers. Tying them up dealing with your queries and not performing their matchmaking and job spawning duties.

Running condor 7.2 for negotiator
You should definitely get up to 7.4.x -- performance for Condor gets better and better with the newer releases.

- Ian

Ian Chesal