Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condorview issues with Job Stats

Date: Mon, 16 Mar 2009 10:26:24 +0900
From: <Greg.Hitchen@xxxxxxxx>
Subject: Re: [Condor-users] Condorview issues with Job Stats

After much fiddling around and testing (including setting up for
UPDATE_COLLECTOR_WITH_TCP = True, this was fun in itself, sorting
out linux fd's and ulimit, etc.) we have come to the following
conclusions.

1. The issue is unrelated to muliple pools reporting to one
condorview server.

2. It is also unrelated to whether UDP or TCP is used for updates.

3. It appears to depend on the number of jobs submitted to the
queue.

In summary, if say MAX_JOBS_SUBMITTED = 100 and 100 jobs are
submitted then the info shown by condorview reflects reality.
If 500 jobs are submitted, then all eventually run to completion OK
(but only 100 run at a time as MAX_JOBS_SUBMITTED = 100) BUT the
info/graphs shown by condorview in the Jobs Statistics ONLY show
the IDLE jobs correctly and NO running jobs. It is only when the
number of jobs in the queue reduces to a lower amount do the running
jobs start to appear in the graphs. At the same time the Machine
Statistics correctly shows the number of machines running condor jobs.
This number appears to be of the order of around 150-200 jobs in
the queue. All our Central Managers and Condorview Servers are
running linux, all our pool machines are windows boxes, all submit
nodes are windows boxes with their local schedd handling their own
jobs queues. The testing shows that the submit CPU's are not overloaded
and average perhaps 20-30% when handling the schedd and all the shadows.
It appears as though the schedd cannot handle the number of jobs in the
queue and report the correct info to the collector at the same time.
Our default setup is for SCHEDD_INTERVAL = 30 with the default
SCHEDD_INTERVAL_TIMESLICE = 0.05 We have also tested with the "normal"
schedd interval of 300s and have even tried the timeslice = 0.5
but the behaviour remains the same.

Again any info/comments/insights would be appreciated.

Thanks

Cheers

Greg

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
Sent: Friday, 20 February 2009 11:50 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condorview issues with Job Stats

On Fri, 20 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:

>
> Hi again Steve
>
> Thanks for your responses and taking the time to help us.
>
>>> But surely that info as seen by the collectors has to come from
>>> Somewhere originally, i.e. from the submit schedd, or elsewhere
>
>> It's coming from each of the submit schedd's advertising to
>> its respective collector which then forwards to the view server.
>
> OK, that makes sense. That is how I thought/assumed it should work.
> That's why I can't understand what's wrong. As you say, the schedd
> updates the collector on it's local central manager which forwards this
> on to the view_server collector.
>
>>>> If you want the condorview server to show all three of pools,
>>>> then VIEW_SERVER on pool B and pool C should be set to be
>>>> the same as the VIEW_SERVER on pool A.  You can, and many
>>>> do, aggregate the output of many collectors into one VIEW_SERVER.
>>>
>>> This is our setup. Using the previous example we have all 3
>>> collectors in pools A, B and C reporting to our only condorview server
>>> which resides in pool A.

So are you running an extra copy of the condor collector on
A to be a dedicated view server, in addition to the normal
collector on A?

>>
>> From what you described in the original message, only pool A is
>> in fact reporting to condorview. The other 2 are not.
>> Check the logs of collector startup on B and C---if they are reporting
>> it would say so.
>
> In fact, it appears that only jobs running in the same pool as the
> submit machine are getting correctly reported, regardless of which
> pool the submit node is in. It is jobs that flock to another Pool
> that are not getting reported.
>
> Perhaps my original email didn't describe things well.
>
> All collectors in pools A, B and C are configured to report to the
> view_server collector (which just happens to reside in pool A).
>
> If a submit machine in A runs jobs in A then the view server reports
> running jobs as expected.
>
> If a submit machine in A runs jobs in B or C then the view server
> does not report the jobs as running (condor_q show them running though).

You weren't really clear on how a submit machine in A is submitting
jobs to B and C.  Flocking? Condor-G? 
>
> If a submit machine in B runs jobs in B then the view server reports
> the jobs as running as expected.
>
> If a submit machine in B runs jobs in A or C then the view server
> does not report then jobs as running (condor_q / schedd does).

Remember that the view server really doesn't care about
jobs at all--it reports numbers of *machines* that are claimed
at any given time by any given user.

Does the total number of machines in the machine plot correspond
to the total number of machines in A+B+C?

>
> The exact same trends occur for a submit machine in C.
>
> We're about to bite the bullet and try updates with TCP, even though
> the manual doesn't exactly sound encouraging! :)
>
I've been running updates with TCP for the last 4 years and only
recently found the first problem with that technique--namely
a very rare hang condition if a network port drops out in
mid-update, which the condor team promptly patched.
But if TCP was your problem, you would be seeing a fluctuating
number of machines being reported in condor_status and
in the condorview graphs.

How is your condorview server set up?
What is the value of POOL in make_stats script?

> Cheers
>
> Greg

The condorview server being a collector after all, it can have
all the debug diagnostics that a regular collector does. Turn
all collectors to D_SECURITY D_COMMAND D_FULLDEBUG
logging and you will see what is getting forwarded and what is not.
You should be able to do a condor_status against your
view server and see the aggregated classads from all three pools.
If not, something is wrong.. maybe an access issue, or maybe
flocking.

Steve Timm

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>

-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] Condorview issues with Job Stats
  - From: Greg.Hitchen

Prev by Date: [Condor-users] Compressing checkpoint file in condor
Next by Date: Re: [Condor-users] Condorview issues with Job Stats
Previous by thread: [Condor-users] Compressing checkpoint file in condor
Next by thread: Re: [Condor-users] Condorview issues with Job Stats
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condorview issues with Job Stats