[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condorview issues with Job Stats



PS--you can see that the individual jobs in pools B and C are running
because that information is obtained from the respective
condor_schedd.

Steve Timm


On Wed, 18 Feb 2009, Steven Timm wrote:

On Thu, 19 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:


If you have a standalone condorview server the docs suggest that
you only need a collector running, not a negotiator as well?
We tried having the negotiator run as well but it appeared to make
no difference to what the condorview server was seeing.

That is right.  A condorview server is just a collector.
It can, in fact, be the same collector as your regular collector.

Condorview has nothing to do with whether any particular
job is running, it only knows the totals as seen by the
collector, based on condor_status -submitters

If you want the condorview server to show all three of pools,
then VIEW_SERVER on pool B and pool C should be set to be
the same as the VIEW_SERVER on pool A.  You can, and many
do, aggregate the output of many collectors into one VIEW_SERVER.

Steve



We've done some more testing so perhaps a simplified example of
what's happening will bring some more suggestions/answers.

Condor setup with 3 pools, A, B and C. Pools located in 3 different
states (Australia), routers, etc. in between.

Condorview server in pool A.

Submit machine geographically in region A and in condor pool A.
Submit jobs configured to only run in pool A. All OK. Condorview
shows running jobs.

Submit machine geographically in region A and in condor pool A.
Submit jobs configured to only run in pool B (or C). Running jobs
not showing up in condorview stats/graphs. Show up OK as running
using condor_q on submit machine.

Submit machine geographically in region A and in condor pool B.
Submit jobs configured to only run in pool B. All OK. Condorview
shows running jobs.

Submit machine geographically in region A and in condor pool B.
Submit jobs configured to only run in pool A (or C). Running jobs
not showing up in condorview stats/graphs. Show up OK as running
using condor_q on submit machine.

Submit machine geographically in region A and in condor pool C.
Submit jobs configured to only run in pool C. All OK. Condorview
shows running jobs.

Submit machine geographically in region A and in condor pool C.
Submit jobs configured to only run in pool A (or B). Running jobs
not showing up in condorview stats/graphs. Show up OK as running
using condor_q on submit machine.

i.e. it appears as though jobs that have flocked to and are running
in a different pool to the one in which they were submitted are
not being "seen" by the condorview server, even though the submit
schedd knows that they are.

We realize that the condorview collector is just getting info
forwarded to it from the collectors on the central managers.
But what info and from what daemons is the info as to whether a job is
running coming from.

Thanks for any further insights you might have.

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
Sent: Friday, 13 February 2009 12:29 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condorview issues with Job Stats

On Fri, 13 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:

Hi All

Revisiting an issue that we've asked about before over the past couple
of years but have never really solved. It relates to the User Job Statistics
part of condorview.

A first general question to the condor developers would be where is the
condorview server getting this data from (obviously forwarded on from the
central managers). Is it the schedd of the submitting nodes? That was my
assumption, or are the starter and shadow involved as well.

The condorview server gets all of its statistics from the collector
and negotiator.  Schedd, starter, and shadow are not involved
at all.  Most of the information can be considered to be
snapshots of what you get from condor_userprio.

So it is counting jobs as running from the time the node is claimed
until the time the claim is released.   As such condorview
will never tell you how many independent jobs have started
and finished, only the aggregate hours used.

Steve Timm




To illustrate the problem we are having I have attached to jpg's from our
condorview machine when I was testing things by running 100 jobs that
take 2 hours to run. In the first example (condor_test.jpg) jobs start
running ~ 19:07 and the number of idle jobs drops rapidly straight away.
However, there is a gap of ~30 mins before running jobs are seen. Even
then there later appears a large gap in the red running jobs of ~1hr.
The second example (condor_test1.jp), I restricted the jobs to our local pool,
to eliminate issues due to routers, etc. as we have several pools in our
organisation spread around the country in different states. The same problems
of the red running jobs not showing up occurs, although in this case only ~20
jobs run at a time because they are not being flocked to other pools.

Sorry for the long email but we would like to sort this out, as to get the
"correct" total number of job running hours we need to get the history
file from each submitter, run them through condor_history, and manually
figure it out in excel. This always gives number ~3-4X that shown in condorview.

BTW our CMs and condorview server are linux and the submit and execute
nodes are winxp.

Thanks for any help/info.

Cheers

Greg
------------------------------------------------------------------------------------------------------
Greg Hitchen                                                                         greg.hitchen@xxxxxxxx
CSIRO IM&T Advanced Scientific Computing              phone: +61 8 6436 8663
Australian Resources Research Centre (ARRC)             fax:       +61 8 6436 8555
Postal address:                                                                     mob:          0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-------------------------------------------------------------------------------------------------------






--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.