Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condorview issues with Job Stats

Date: Thu, 19 Feb 2009 20:50:16 -0600 (CST)
From: Steven Timm <timm@xxxxxxxx>
Subject: Re: [Condor-users] Condorview issues with Job Stats

On Fri, 20 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:


Hi again Steve

Thanks for your responses and taking the time to help us.

But surely that info as seen by the collectors has to come from
Somewhere originally, i.e. from the submit schedd, or elsewhere

It's coming from each of the submit schedd's advertising to
its respective collector which then forwards to the view server.


OK, that makes sense. That is how I thought/assumed it should work.
That's why I can't understand what's wrong. As you say, the schedd
updates the collector on it's local central manager which forwards this
on to the view_server collector.

If you want the condorview server to show all three of pools,
then VIEW_SERVER on pool B and pool C should be set to be
the same as the VIEW_SERVER on pool A.  You can, and many
do, aggregate the output of many collectors into one VIEW_SERVER.


This is our setup. Using the previous example we have all 3
collectors in pools A, B and C reporting to our only condorview server
which resides in pool A.


So are you running an extra copy of the condor collector on
A to be a dedicated view server, in addition to the normal
collector on A?


From what you described in the original message, only pool A is
in fact reporting to condorview. The other 2 are not.
Check the logs of collector startup on B and C---if they are reporting
it would say so.


In fact, it appears that only jobs running in the same pool as the
submit machine are getting correctly reported, regardless of which
pool the submit node is in. It is jobs that flock to another Pool
that are not getting reported.

Perhaps my original email didn't describe things well.

All collectors in pools A, B and C are configured to report to the
view_server collector (which just happens to reside in pool A).

If a submit machine in A runs jobs in A then the view server reports
running jobs as expected.

If a submit machine in A runs jobs in B or C then the view server
does not report the jobs as running (condor_q show them running though).


You weren't really clear on how a submit machine in A is submitting

jobs to B and C. Flocking? Condor-G?


If a submit machine in B runs jobs in B then the view server reports
the jobs as running as expected.

If a submit machine in B runs jobs in A or C then the view server
does not report then jobs as running (condor_q / schedd does).


Remember that the view server really doesn't care about
jobs at all--it reports numbers of *machines* that are claimed
at any given time by any given user.

Does the total number of machines in the machine plot correspond
to the total number of machines in A+B+C?


The exact same trends occur for a submit machine in C.

We're about to bite the bullet and try updates with TCP, even though
the manual doesn't exactly sound encouraging! :)

I've been running updates with TCP for the last 4 years and only
recently found the first problem with that technique--namely
a very rare hang condition if a network port drops out in
mid-update, which the condor team promptly patched.
But if TCP was your problem, you would be seeing a fluctuating
number of machines being reported in condor_status and
in the condorview graphs.


How is your condorview server set up?
What is the value of POOL in make_stats script?

Cheers

Greg


The condorview server being a collector after all, it can have
all the debug diagnostics that a regular collector does. Turn
all collectors to D_SECURITY D_COMMAND D_FULLDEBUG
logging and you will see what is getting forwarded and what is not.
You should be able to do a condor_status against your
view server and see the aggregated classads from all three pools.
If not, something is wrong.. maybe an access issue, or maybe
flocking.

Steve Timm

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

References:
- [Condor-users] Condorview issues with Job Stats
  - From: Greg.Hitchen
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Steven Timm
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Greg.Hitchen
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Steven Timm
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Greg.Hitchen
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Steven Timm
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Greg.Hitchen

Prev by Date: [Condor-users] What is "the CPU hour"?
Next by Date: [Condor-users] How can I prevent condor_status to provide info on the pool PCs?
Previous by thread: Re: [Condor-users] Condorview issues with Job Stats
Next by thread: [Condor-users] Negotiation Cycle between Linux master and WindowsXP pool
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condorview issues with Job Stats