[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Monitoring the load of a job

On 3/22/2012 7:28 AM, Hermann Fuchs wrote:

We usually use a combination of
condor_q -run to see which jobs belongs to which slot
and condor_status to see the load on this slot.
However, I do not know how to combine this information.

Best regards,

Couple quick thoughts on this :

Be aware that simply entering

   condor_status -run

will display all busy slots along with the load average and the submitting user name (and the machine they submitted from).

It is always useful to take a peek at all the machine attributes (condor_status -l) and/or job attributes (condor_q -l) and see what is there. Doing this note that for any slot that is claimed/busy, the JobId of the job running on that slot appears in the machine ad. (along with other info like the GlobalJobId, RemoteUser, etc).

So to see the load average generated by a particular job, for instance job "92.0", you could do this:

condor_status -cons 'JobId=="92.0"'

The above will get the load info from the central manager, which is updated only periodically, so the load average may be a couple minutes stale. (which likely doesn't matter much, since load avg is averaged over a period of time anyhow).

If you want up-to-the-second load info, you could use the "-direct" argument in condor_status to go directly to the execute node instead of using the cached info in the central manager. To do this we can use the back-tick geekness of the shell to state which node to directly query via another invocation of condor_status like so:

condor_status -direct `condor_status -con 'JobId=="92.0"' -format "%s\n" Name`

Hope the above is helpful and not overly geeky,

On Thu, 2012-03-22 at 11:35 +0000, Bob Briscoe wrote:
Can one monitor the load generated by a particular job as it's running? I ask because occassionaly a job may claim a slot, be in running state, but actually be sitting idle as it's expecting some input to be sent to it from some other machine (e.g. could be a case of deadlock). In such a case it would be useful to see that slot's load. I know that condor_status publishes the loads of slots, but it often gets its mappings wrong, so unclaimed states are reported under load whereas working slots are shown to be un-loaded. Also, we'd like to do this via condor_q or some similar command which would specify the job id or user id.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
Condor Project Technical Lead          1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685