[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Monitoring the load of a job

Hi Todd/Hermann,

Thanks for the interest. Todd, your approach appears promising but it seems to come unstuck on multi core machines. For example, consider the output from a four core box that doesn't have all slots claimed:

$ condor_status -direct `condor_status -con 'JobId=="123442.4"' -format "%s\n" Name`

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy     1.000  2003  4+02:18:27
alphabox03.po LINUX      X86_64 Claimed   Busy     0.990  2003  4+02:18:27
alphabox03.po LINUX      X86_64 Unclaimed Idle     1.000  2003  0+03:44:11
alphabox03.po LINUX      X86_64 Claimed   Busy     0.000  2003  5+10:21:40
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       3         1       0          0        0

               Total     4     0       3         1       0          0        0


Note that slot3 is unclaimed/idle, but is assigned a LoadAvg value of 1.0, whereas the reverse is true for slot4 (assigned a LoadAvg Value of 0.0), so it would appear that the Condor loads are being assigned to the incorrect slots. So when I try to get slot4's load directly I get the erroneous value of zero::

$ condor_status -con 'JobId=="123442.4"' -format "%s " Name -format "%s\n" LoadAvg
slot4@xxxxxxxxxxxxxxxxxxxxxxxxxxx  0.0

These machines are running 64 bit Debian stable, but we see such behaviour with other Linux distibutions. Is there a workaround for this problem?


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Cc: Hermann Fuchs <hermann.fuchs@xxxxxxxxxxxxxxxx>; Bob Briscoe <paw_deer@xxxxxxxxxxx>
Sent: Thursday, 22 March 2012, 17:47
Subject: Re: [Condor-users] Monitoring the load of a job

On 3/22/2012 7:28 AM, Hermann Fuchs wrote:
> Hi
> We usually use a combination of
> condor_q -run to see which jobs belongs to which slot
> and condor_status to see the load on this slot.
> However, I do not know how to combine this information.
> Best regards,
> Hermann

Couple quick thoughts on this :

Be aware that simply entering

  condor_status -run

will display all busy slots along with the load average and the submitting user name (and the machine they submitted from).

It is always useful to take a peek at all the machine attributes (condor_status -l) and/or job attributes (condor_q -l) and see what is there. Doing this note that for any slot that is claimed/busy, the JobId of the job running on that slot appears in the machine ad.  (along with other info like the GlobalJobId, RemoteUser, etc).

So to see the load average generated by a particular job, for instance job "92.0", you could do this:

condor_status -cons 'JobId=="92.0"'

The above will get the load info from the central manager, which is updated only periodically, so the load average may be a couple minutes stale.  (which likely doesn't matter much, since load avg is averaged over a period of time anyhow).

If you want up-to-the-second load info, you could use the "-direct" argument in condor_status to go directly to the execute node instead of using the cached info in the central manager. To do this we can use the back-tick geekness of the shell to state which node to directly query via another invocation of condor_status like so:

condor_status -direct `condor_status -con 'JobId=="92.0"' -format "%s\n" Name`

Hope the above is helpful and not overly geeky,

> On Thu, 2012-03-22 at 11:35 +0000, Bob Briscoe wrote:
>> Hi,
>> Can one monitor the load generated by a particular job as it's running? I ask because occassionaly a job may claim a slot, be in running state, but actually be sitting idle as it's expecting some input to be sent to it from some other machine (e.g. could be a case of deadlock). In such a case it would be useful to see that slot's load. I know that condor_status publishes the loads of slots, but it often gets its mappings wrong, so unclaimed states are reported under load whereas working slots are shown to be un-loaded. Also, we'd like to do this via condor_q or some similar command which would specify the job id or user id.
>> TIA,
>> Bob
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/

-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing  Department of Computer Sciences
Condor Project Technical Lead          1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685