[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] When do machine RANK settings apply?



> > This got close the desired behaviour but lower priority users were 
> > still preempting higher priority users on occasion. And 
> users new to 
> > the system were getting more resources than users with long-time 
> > running jobs if they were all the same (highest) priority level.
> 
> Yes I gave up trying to have the negotiator reference both 
> the currently running job and the job it is evaluating. It 
> just doesn't seem to work. Using machine RANK is much easier 
> if you don't mind preemption (or are happy to work round it 
> with long retirement
> promises)
>  
> > I'm now experiments with your suggestion of:
> > 
> > PREEMPTION_REQUIREMENTS = False
> > PRIORITY_HALFLIFE = 1
> > RANK = (TARGET.JobPrio * 2880)
> > 
> > With our very long retirement time (enough for our jobs to finish
> > normally) this should be okay. I'll let you know how it works out.
> 
> hope it all works...out of interest how many 
> execute/submission nodes exist in your farm and are any/all 
> of them windows.

Hmm. Actually, it isn't working out at all. I had users with jobs in the
system with JobPrio's of 10 and 11 respecitivily. I sent a single job in
with a JobPrio of 16 and expected my job to run on the next available
machine. Not the case. My job is still sitting there while the two
lower-JobPrio user's are passing jobs through the system. I have
attached some cumluative logging output that I'm collecting every 10
minutes. I have a custom command called abc_who that groups users jobs
by priority and shows running and queued. You can see that I (ichesal)
have a priority 16 job queued, but that bchan's priority 12 jobs keep
getting resources.

Looking at the NegotiatorLog it wants to preempt bchan's jobs for mine
but it can't because PREEMPTION_REQUIREMENTS are false. I think what I'm
observing here is that bchan's schedd holding on to the startd machine
after a job finishes and just running the next job in her list. Why is
my higher-ranking job not taking over this machine? bchan's jobs rank at
12*2880 = 34560 whereas my job ranks at 16*2880 = 46080. Here is snippet
from my NegotiatorLog:

1/5 13:29:13   Negotiating with ichesal@xxxxxxxxxx at
<137.57.142.112:40413>
1/5 13:29:13     Request 00094.00000:
1/5 13:29:13       Rejected 94.0 ichesal@xxxxxxxxxx
<137.57.142.112:40413>: PREEMPTION_REQUIREMENTS == False
1/5 13:29:13     Got NO_MORE_JOBS;  done negotiating
<snip>
1/5 13:31:45   Negotiating with ichesal@xxxxxxxxxx at
<137.57.142.112:40413>
1/5 13:31:45     Request 00094.00000:
1/5 13:31:45       Preempting bchan@xxxxxxxxxx (prio=2.37) on
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx for ichesal@xxxxxxxxxx
(prio=0.50)
1/5 13:31:45       Matched 94.0 ichesal@xxxxxxxxxx
<137.57.142.112:40413> preempting bchan@xxxxxxxxxx <137.57.176.180:4964>
1/5 13:31:45       Successfully matched with
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/5 13:31:45     Got NO_MORE_JOBS;  done negotiating
<snip>
1/5 13:38:38   Negotiating with ichesal@xxxxxxxxxx at
<137.57.142.112:40413>
1/5 13:38:38     Request 00094.00000:
1/5 13:38:38       Preempting bchan@xxxxxxxxxx (prio=2.68) on
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx for ichesal@xxxxxxxxxx
(prio=0.92)
1/5 13:38:38       Matched 94.0 ichesal@xxxxxxxxxx
<137.57.142.112:40413> preempting bchan@xxxxxxxxxx <137.57.176.182:3796>
1/5 13:38:38       Successfully matched with
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/5 13:38:38     Got NO_MORE_JOBS;  done negotiating
<snip>
1/5 13:48:02   Negotiating with ichesal@xxxxxxxxxx at
<137.57.142.112:40413>
1/5 13:48:02     Request 00094.00000:
1/5 13:48:02       Rejected 94.0 ichesal@xxxxxxxxxx
<137.57.142.112:40413>: PREEMPTION_REQUIREMENTS == False
1/5 13:48:02     Got NO_MORE_JOBS;  done negotiating
<snip>
1/5 13:52:08   Negotiating with ichesal@xxxxxxxxxx at
<137.57.142.112:40413>
1/5 13:52:08     Request 00094.00000:
1/5 13:52:08       Preempting bchan@xxxxxxxxxx (prio=2.69) on
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx for ichesal@xxxxxxxxxx
(prio=0.50)
1/5 13:52:08       Matched 94.0 ichesal@xxxxxxxxxx
<137.57.142.112:40413> preempting bchan@xxxxxxxxxx <137.57.176.182:3796>
1/5 13:52:08       Successfully matched with
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/5 13:52:08     Got NO_MORE_JOBS;  done negotiating

I've tracking user priority, a custom condor_status view and another
view of running and queued jobs every ten minutes. Here is the data that
relevant to this time frame in the negotiator log. You can see every ten
minutes the EUPs are dumped for the active users. Then custom
condor_status output that lists the jobid, the rank of the job, the
owner, and the activity as well as the vmem in the machine and the
imagesize of the job. Then there's the output from my custom tool that
shows running and queued jobs sorted by priority and by user.

What's weird about this is that even thought I (ichesal) haven't
actually gotten to run any jobs my EUP has increased from 13:30 to
13:50! What's up with that?

And you can see bchan's priority 12 jobs running and my priority 16 job
sitting there idle. There's something not quite right about this setup.
RANK is not proving effective all on it's own. I don't think bchan's
schedd is releasing the startd even though there are higher ranking
jobs. Does this seem plausible?
 
----------
Wed Jan  5 13:30:00 EST 2005
----------
Last Priority Update:  1/5  13:19
                                    Effective
User Name                           Priority 
------------------------------      ---------
clam@xxxxxxxxxx                          0.50
ichesal@xxxxxxxxxx                       0.50
bchan@xxxxxxxxxx                         2.28
kbrunham@xxxxxxxxxx                      2.72
------------------------------      ---------
Number of users shown: 4                           
----------
14.29	35443.000000	bchan@xxxxxxxxxx	Retiring	2097151
754568
14.45	35464.000000	bchan@xxxxxxxxxx	Busy	2097151	134016
11.16	34644.000000	bchan@xxxxxxxxxx	Retiring	2097151
1951884
14.22	35459.000000	bchan@xxxxxxxxxx	Busy	2097151	134016
5.0	34742.000000	kbrunham@xxxxxxxxxx	Retiring	2097151
768168
14.44	35455.000000	bchan@xxxxxxxxxx	Retiring	2097151
488808
11.1	29074.000000	bchan@xxxxxxxxxx	Retiring	2097151
701064
14.85	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
490356
14.17	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
370024
11.7	29371.000000	bchan@xxxxxxxxxx	Retiring	2097151
1483080
14.26	35455.000000	bchan@xxxxxxxxxx	Busy	2097151	518208
11.24	35279.000000	bchan@xxxxxxxxxx	Busy	2097151	824008
14.21	35433.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
12.78	35413.000000	bchan@xxxxxxxxxx	Retiring	2097151
490872
11.13	35286.000000	bchan@xxxxxxxxxx	Busy	2097151	802632
14.28	35443.000000	bchan@xxxxxxxxxx	Retiring	2097151
1278852
11.8	35276.000000	bchan@xxxxxxxxxx	Retiring	2097151
857160
11.17	35436.000000	bchan@xxxxxxxxxx	Retiring	2097151
1637768

----------
abc_who (version 1.1)
-----------------------------------------------------------------
Priority   User         Running    Queued     Constraint Set
-----------------------------------------------------------------
    16     ichesal      0          1          Set 1
    12     bchan        17         136        Set 1
    12     kbrunham     1          413        Set 1
    11     ichesal      0          1          Set 1
    11     kbrunham     0          252        Set 1
    10     bchan        0          119        Set 1
    10     clam         0          300        Set 1
    10     kbrunham     0          52         Set 1
-----------------------------------------------------------------
----------
----------
Wed Jan  5 13:40:00 EST 2005
----------
Last Priority Update:  1/5  13:31
                                    Effective
User Name                           Priority 
------------------------------      ---------
clam@xxxxxxxxxx                          0.50
ichesal@xxxxxxxxxx                       0.50
kbrunham@xxxxxxxxxx                      2.29
bchan@xxxxxxxxxx                         2.37
------------------------------      ---------
Number of users shown: 4                           
----------
14.29	35443.000000	bchan@xxxxxxxxxx	Retiring	2097151
754568
14.45	35464.000000	bchan@xxxxxxxxxx	Busy	2097151	134016
11.16	34644.000000	bchan@xxxxxxxxxx	Retiring	2097151
1951884
14.22	35459.000000	bchan@xxxxxxxxxx	Busy	2097151	134016
5.0	34742.000000	kbrunham@xxxxxxxxxx	Retiring	2097151
768168
14.44	35455.000000	bchan@xxxxxxxxxx	Retiring	2097151
488808
11.1	29074.000000	bchan@xxxxxxxxxx	Retiring	2097151
701064
14.85	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
490356
14.17	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
370024
11.7	29371.000000	bchan@xxxxxxxxxx	Retiring	2097151
1483080
14.26	35455.000000	bchan@xxxxxxxxxx	Busy	2097151	518208
11.24	35279.000000	bchan@xxxxxxxxxx	Busy	2097151	824008
14.21	35433.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
12.78	35413.000000	bchan@xxxxxxxxxx	Retiring	2097151
490872
11.13	35286.000000	bchan@xxxxxxxxxx	Busy	2097151	802632
14.28	35443.000000	bchan@xxxxxxxxxx	Retiring	2097151
1278852
11.8	35276.000000	bchan@xxxxxxxxxx	Retiring	2097151
857160
11.17	35436.000000	bchan@xxxxxxxxxx	Retiring	2097151
1637768

----------
abc_who (version 1.1)
-----------------------------------------------------------------
Priority   User         Running    Queued     Constraint Set
-----------------------------------------------------------------
    16     ichesal      0          1          Set 1
    12     bchan        17         131        Set 1
    12     kbrunham     1          413        Set 1
    11     ichesal      0          1          Set 1
    11     kbrunham     0          252        Set 1
    10     bchan        0          119        Set 1
    10     clam         0          152        Set 1
    10     kbrunham     0          52         Set 1
-----------------------------------------------------------------
----------
----------
Wed Jan  5 13:50:00 EST 2005
----------
Last Priority Update:  1/5  13:38
                                    Effective
User Name                           Priority 
------------------------------      ---------
clam@xxxxxxxxxx                          0.50
ichesal@xxxxxxxxxx                       0.92
kbrunham@xxxxxxxxxx                      1.40
bchan@xxxxxxxxxx                         2.68
------------------------------      ---------
Number of users shown: 4                           
----------
14.30	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
134016
14.45	35464.000000	bchan@xxxxxxxxxx	Retiring	2097151
492904
11.16	34644.000000	bchan@xxxxxxxxxx	Retiring	2097151
1951884
14.50	35469.000000	bchan@xxxxxxxxxx	Retiring	2097151
489832
5.0	34742.000000	kbrunham@xxxxxxxxxx	Retiring	2097151
768168
14.44	35455.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
11.1	29074.000000	bchan@xxxxxxxxxx	Retiring	2097151
701064
14.23	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
168672
14.17	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
763240
11.7	29371.000000	bchan@xxxxxxxxxx	Busy	2097151	1483080
14.19	35455.000000	bchan@xxxxxxxxxx	Retiring	2097151
489832
11.24	35279.000000	bchan@xxxxxxxxxx	Retiring	2097151
899912
14.24	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
492408
14.25	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
500668
11.13	35286.000000	bchan@xxxxxxxxxx	Retiring	2097151
802632
14.28	35443.000000	bchan@xxxxxxxxxx	Retiring	2097151
1278852
11.8	35276.000000	bchan@xxxxxxxxxx	Retiring	2097151
857160
11.17	35436.000000	bchan@xxxxxxxxxx	Retiring	2097151
1710920

----------
abc_who (version 1.1)
-----------------------------------------------------------------
Priority   User         Running    Queued     Constraint Set
-----------------------------------------------------------------
    16     ichesal      0          1          Set 1
    12     bchan        17         123        Set 1
    12     kbrunham     1          413        Set 1
    11     ichesal      0          1          Set 1
    11     kbrunham     0          252        Set 1
    10     bchan        0          119        Set 1
    10     clam         0          152        Set 1
    10     kbrunham     0          52         Set 1
-----------------------------------------------------------------
----------
----------
Wed Jan  5 13:58:00 EST 2005
----------
Last Priority Update:  1/5  13:51
                                    Effective
User Name                           Priority 
------------------------------      ---------
clam@xxxxxxxxxx                          0.50
ichesal@xxxxxxxxxx                       0.50
kbrunham@xxxxxxxxxx                      2.31
bchan@xxxxxxxxxx                         2.69
------------------------------      ---------
Number of users shown: 4                           
----------
14.30	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
488808
14.45	35464.000000	bchan@xxxxxxxxxx	Retiring	2097151
492904
11.16	34644.000000	bchan@xxxxxxxxxx	Busy	2097151	1951884
14.50	35469.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
5.0	34742.000000	kbrunham@xxxxxxxxxx	Busy	2097151	768168
14.44	35455.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
11.1	29074.000000	bchan@xxxxxxxxxx	Busy	2097151	701064
14.23	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
750952
14.17	35450.000000	bchan@xxxxxxxxxx	Retiring	2097151
763240
11.7	29371.000000	bchan@xxxxxxxxxx	Retiring	2097151
1483080
14.46	35483.000000	bchan@xxxxxxxxxx	Busy	2097151	0
11.24	35279.000000	bchan@xxxxxxxxxx	Retiring	2097151
837384
14.41	35476.000000	bchan@xxxxxxxxxx	Busy	2097151	29724
14.42	35476.000000	bchan@xxxxxxxxxx	Busy	2097151	359712
11.13	35286.000000	bchan@xxxxxxxxxx	Retiring	2097151
802632
14.31	35443.000000	bchan@xxxxxxxxxx	Busy	2097151	134016
11.8	35276.000000	bchan@xxxxxxxxxx	Busy	2097151	857160
11.17	35436.000000	bchan@xxxxxxxxxx	Retiring	2097151
1841928

----------
abc_who (version 1.1)
-----------------------------------------------------------------
Priority   User         Running    Queued     Constraint Set
-----------------------------------------------------------------
    16     ichesal      0          1          Set 1
    12     bchan        17         119        Set 1
    12     kbrunham     1          413        Set 1
    11     ichesal      0          1          Set 1
    11     kbrunham     0          252        Set 1
    10     bchan        0          119        Set 1
    10     clam         0          152        Set 1
    10     kbrunham     0          52         Set 1
-----------------------------------------------------------------
----------


As for your question about our setup, we have the following
configuration:

Central Manager: RH 9 machines
Dedicated Startd Machines: Mostly Win2k machines, dual proc
Client Schedd Machines: Mix of Win2k/WinXP machines, dual proc

There are 9 dedicated startd machines running jobs for users. This is
our test pool. The plan was to expand this to over 100 on Monday, but
with things are the way they are right now with this scheduling stuff
causing me so much trouble I'm probably going to push that off. All in
all, if Condor works out it will be sitting on upwards of 300 dedicated
startd machines by the end of the month at our site. With another 1000
or so machines at other sites around the world. There are about 4
submission nodes in use right now. Each user has their desktop machine
set up as a submission node. There are only really 4 active users of our
test installation right now.
 
> Since you are running 6.7 series I would be interested to 
> know since our windows pool was unable to handle job queues 
> over about 100 or so jobs before the submitters schedd's and 
> shadows began to have serious issues.

We were seeing intermittent crashes when the user queued more than 100
jobs with 6.7.2. Mostly due to log file writing issues. With 6.7.3 all
crashing has disappeared. And we've really been trying to pound on it.
We run vanilla jobs. With minimal file transfer done using Condor.
Basically a small starup script to kick off the job and that's it.

- Ian