Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Still having Condorview issues with Job Statisics

Date: Mon, 6 Apr 2009 16:55:47 +0800
From: <Greg.Hitchen@xxxxxxxx>
Subject: [Condor-users] Still having Condorview issues with Job Statisics
 
OK, more testing indicates that conclusion 1. in the previous
email below is incorrect. Our latest testing involved running
wireshark on the submit PC and on the condorview server. We checked
the UDP packet info leaving the submit PC (this goes to the
central manager/s). We then checked the UDP packet info
arriving at the condorview server, which has been forwarded
by the central manager/s.

A) the UDP packet info sent by the submit machine is the same
as that received by the condorview server.

B) in all cases the running, idle, and flocked jobs info reflected
reality, as seen by condor_q and condor_status.

C) the CM the submit PC belonged to was always sent UDP packet
info that showed jobs running whereas the other CMs reported them
as flocked, it didn't matter where the actual jobs were running.
i.e. submit machine in pool A with say 100 jobs (50 idle jobs,
25 jobs running in pool A and 25 running in pool B) would report
to CM A 50 idle, 50 running, 0 flocked and report to CM B 50 idle,
0 running, 50 flocked. CM's A and B would then forward these UDP
updates to the condorview server. 

D) the graphical info plotted by the condorview server would
only be correct in 2 situations:
1) flocking was turned on AND max_jobs_running was GREATER than
the total number of jobs submitted. E.g. max_jobs_running = 100,
50 jobs submitted works OK, max_jobs_running = 25, 50 jobs submitted
is NOT displayed OK (idle jobs displayed OK but NO running jobs
are displayed).
2) flocking was turned off, this works regardless of max_jobs_running
and the no. of submitted jobs. E.g. max_jobs_running = 200, submit
1000 jobs, idle and running jobs displayed OK.

Just reiterating that the UDP packet info going to the condorview
server looks OK in all situations.

Initial testing done with windows submit PC, linux CM, linux viewserver
all running 7.0.5 Upgrading to 7.2.1 produced the same results.
As mentioned in the previous email, remote submission via a SLES10
submit node also produced the same results.

It's almost as if the condorview collector is somehow not doing
the correct calculations when flocking is turned on and thus
getting UDP packets from all the central managers (forwarded on from the
schedd on the submit machine).

If anyone, particularly from the Condor Team, can shed any light
on this we would really appreciate it, and thanks to anyone that has
actually read this far! :)

Thanks.

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx
Sent: Monday, 16 March 2009 1:44 PM
To: condor-users@xxxxxxxxxxx
Subject: [ExternalEmail] Re: [Condor-users] Condorview issues with Job Stats

 
Forgot to mention that the same happens even if submitting
remotely from a windows machine to a sles10 submit machine.
i.e. idle jobs show up in the graphs but running jobs do not.

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx
Sent: Monday, 16 March 2009 10:26 AM
To: condor-users@xxxxxxxxxxx
Subject: [ExternalEmail] Re: [Condor-users] Condorview issues with Job Stats

 
After much fiddling around and testing (including setting up for
UPDATE_COLLECTOR_WITH_TCP = True, this was fun in itself, sorting
out linux fd's and ulimit, etc.) we have come to the following
conclusions.

1. The issue is unrelated to muliple pools reporting to one
condorview server.

2. It is also unrelated to whether UDP or TCP is used for updates.

3. It appears to depend on the number of jobs submitted to the
queue.

In summary, if say MAX_JOBS_SUBMITTED = 100 and 100 jobs are
submitted then the info shown by condorview reflects reality.
If 500 jobs are submitted, then all eventually run to completion OK
(but only 100 run at a time as MAX_JOBS_SUBMITTED = 100) BUT the
info/graphs shown by condorview in the Jobs Statistics ONLY show
the IDLE jobs correctly and NO running jobs. It is only when the
number of jobs in the queue reduces to a lower amount do the running
jobs start to appear in the graphs. At the same time the Machine
Statistics correctly shows the number of machines running condor jobs.
This number appears to be of the order of around 150-200 jobs in
the queue. All our Central Managers and Condorview Servers are
running linux, all our pool machines are windows boxes, all submit
nodes are windows boxes with their local schedd handling their own
jobs queues. The testing shows that the submit CPU's are not overloaded
and average perhaps 20-30% when handling the schedd and all the shadows.
It appears as though the schedd cannot handle the number of jobs in the
queue and report the correct info to the collector at the same time.
Our default setup is for SCHEDD_INTERVAL = 30 with the default
SCHEDD_INTERVAL_TIMESLICE = 0.05 We have also tested with the "normal"
schedd interval of 300s and have even tried the timeslice = 0.5
but the behaviour remains the same.

Again any info/comments/insights would be appreciated.

Thanks

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
Sent: Friday, 20 February 2009 11:50 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condorview issues with Job Stats

On Fri, 20 Feb 2009, Greg.Hitchen@xxxxxxxx wrote:

>
> Hi again Steve
>
> Thanks for your responses and taking the time to help us.
>
>>> But surely that info as seen by the collectors has to come from
>>> Somewhere originally, i.e. from the submit schedd, or elsewhere
>
>> It's coming from each of the submit schedd's advertising to
>> its respective collector which then forwards to the view server.
>
> OK, that makes sense. That is how I thought/assumed it should work.
> That's why I can't understand what's wrong. As you say, the schedd
> updates the collector on it's local central manager which forwards this
> on to the view_server collector.
>
>>>> If you want the condorview server to show all three of pools,
>>>> then VIEW_SERVER on pool B and pool C should be set to be
>>>> the same as the VIEW_SERVER on pool A.  You can, and many
>>>> do, aggregate the output of many collectors into one VIEW_SERVER.
>>>
>>> This is our setup. Using the previous example we have all 3
>>> collectors in pools A, B and C reporting to our only condorview server
>>> which resides in pool A.

So are you running an extra copy of the condor collector on
A to be a dedicated view server, in addition to the normal
collector on A?

>>
>> From what you described in the original message, only pool A is
>> in fact reporting to condorview. The other 2 are not.
>> Check the logs of collector startup on B and C---if they are reporting
>> it would say so.
>
> In fact, it appears that only jobs running in the same pool as the
> submit machine are getting correctly reported, regardless of which
> pool the submit node is in. It is jobs that flock to another Pool
> that are not getting reported.
>
> Perhaps my original email didn't describe things well.
>
> All collectors in pools A, B and C are configured to report to the
> view_server collector (which just happens to reside in pool A).
>
> If a submit machine in A runs jobs in A then the view server reports
> running jobs as expected.
>
> If a submit machine in A runs jobs in B or C then the view server
> does not report the jobs as running (condor_q show them running though).

You weren't really clear on how a submit machine in A is submitting
jobs to B and C.  Flocking? Condor-G? 
>
> If a submit machine in B runs jobs in B then the view server reports
> the jobs as running as expected.
>
> If a submit machine in B runs jobs in A or C then the view server
> does not report then jobs as running (condor_q / schedd does).

Remember that the view server really doesn't care about
jobs at all--it reports numbers of *machines* that are claimed
at any given time by any given user.

Does the total number of machines in the machine plot correspond
to the total number of machines in A+B+C?

>
> The exact same trends occur for a submit machine in C.
>
> We're about to bite the bullet and try updates with TCP, even though
> the manual doesn't exactly sound encouraging! :)
>
I've been running updates with TCP for the last 4 years and only
recently found the first problem with that technique--namely
a very rare hang condition if a network port drops out in
mid-update, which the condor team promptly patched.
But if TCP was your problem, you would be seeing a fluctuating
number of machines being reported in condor_status and
in the condorview graphs.


How is your condorview server set up?
What is the value of POOL in make_stats script?





> Cheers
>
> Greg

The condorview server being a collector after all, it can have
all the debug diagnostics that a regular collector does. Turn
all collectors to D_SECURITY D_COMMAND D_FULLDEBUG
logging and you will see what is getting forwarded and what is not.
You should be able to do a condor_status against your
view server and see the aggregated classads from all three pools.
If not, something is wrong.. maybe an access issue, or maybe
flocking.

Steve Timm


> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>

-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
Prev by Date: [Condor-users] Shadow Exception
Next by Date: [Condor-users] Start SlotID counting at >1?
Previous by thread: [Condor-users] Shadow Exception
Next by thread: [Condor-users] Start SlotID counting at >1?
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

[Condor-users] Still having Condorview issues with Job Statisics