[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condorview issues with Job Stats

On Fri, 13 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:

Hi All

Revisiting an issue that we've asked about before over the past couple
of years but have never really solved. It relates to the User Job Statistics
part of condorview.

A first general question to the condor developers would be where is the
condorview server getting this data from (obviously forwarded on from the
central managers). Is it the schedd of the submitting nodes? That was my
assumption, or are the starter and shadow involved as well.

The condorview server gets all of its statistics from the collector
and negotiator.  Schedd, starter, and shadow are not involved
at all.  Most of the information can be considered to be
snapshots of what you get from condor_userprio.

So it is counting jobs as running from the time the node is claimed
until the time the claim is released.   As such condorview
will never tell you how many independent jobs have started
and finished, only the aggregate hours used.

Steve Timm

To illustrate the problem we are having I have attached to jpg's from our
condorview machine when I was testing things by running 100 jobs that
take 2 hours to run. In the first example (condor_test.jpg) jobs start
running ~ 19:07 and the number of idle jobs drops rapidly straight away.
However, there is a gap of ~30 mins before running jobs are seen. Even
then there later appears a large gap in the red running jobs of ~1hr.
The second example (condor_test1.jp), I restricted the jobs to our local pool,
to eliminate issues due to routers, etc. as we have several pools in our
organisation spread around the country in different states. The same problems
of the red running jobs not showing up occurs, although in this case only ~20
jobs run at a time because they are not being flocked to other pools.

Sorry for the long email but we would like to sort this out, as to get the
"correct" total number of job running hours we need to get the history
file from each submitter, run them through condor_history, and manually
figure it out in excel. This always gives number ~3-4X that shown in condorview.

BTW our CMs and condorview server are linux and the submit and execute
nodes are winxp.

Thanks for any help/info.


Greg Hitchen                                                                         greg.hitchen@xxxxxxxx
CSIRO IM&T Advanced Scientific Computing              phone: +61 8 6436 8663
Australian Resources Research Centre (ARRC)             fax:       +61 8 6436 8555
Postal address:                                                                     mob:          0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151

Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.