[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Understanding user priority and job preemption



Hello,

I have a question concerning user priorities, fair scheduling and job 
preemption. I am trying to reproduce the behavior described in the 
documentation, which I understand as follows: with all other things equal, 
a user with priority 4 should be assigned twice as many machines as a user 
with priority 8.

To test whether this works, I did the following:

1. Set PREEMPTION_REQUIREMENTS from the default UWCS value to True.
2. Set user priority of user A to 4 and of user B to 8 (with 
condor_userprio).
3. Made sure MaxJobRetirementTime is 0.
4. Submitted lots of jobs using user account A.
5. Submitted lots of jobs using user account B.

In both cases my job requirements were stated so that the same 20 nodes 
(out of 60 total available nodes) were matching. After a few negotiation 
cycles, I expected to see a mix of running jobs consisting of 2/3 A-jobs 
and 1/3 B-jobs. However, this did not occur. Instead, only A-jobs were 
running and all B-jobs were waiting in the queue.

I set NEGOTIATOR_DEBUG to ALL and looked into NegotiatorLog. Here is what 
I saw (with user names changed to match the above description):

3/13 17:32:04 (fd:7) (pid:15842) Phase 4.1:  Negotiating with schedds ...
3/13 17:32:04 (fd:7) (pid:15842)     NumStartdAds = 60
3/13 17:32:04 (fd:7) (pid:15842)     NormalFactor = 2.937731
3/13 17:32:04 (fd:7) (pid:15842)     MaxPrioValue = 7.959589
3/13 17:32:04 (fd:7) (pid:15842)     NumScheddAds = 2
3/13 17:32:04 (fd:7) (pid:15842)   Negotiating with userA@cluster at 
<10.0.0.254:16701>
3/13 17:32:04 (fd:7) (pid:15842) 0 seconds so far
3/13 17:32:04 (fd:7) (pid:15842) NEGOTIATOR_IGNORE_USER_PRIORITIES is 
undefined, using default value of False
3/13 17:32:04 (fd:7) (pid:15842)   Calculating schedd limit with the 
following parameters
3/13 17:32:04 (fd:7) (pid:15842)     ScheddPrio       = 4.107686
3/13 17:32:04 (fd:7) (pid:15842)     ScheddPrioFactor = 1.000000
3/13 17:32:04 (fd:7) (pid:15842)     scheddShare      = 0.659601
3/13 17:32:04 (fd:7) (pid:15842)     scheddAbsShare   = 0.500000
3/13 17:32:04 (fd:7) (pid:15842)     ScheddUsage      = 20
3/13 17:32:04 (fd:7) (pid:15842)     scheddLimit      = 20
3/13 17:32:04 (fd:7) (pid:15842)     MaxscheddLimit   = 20
3/13 17:32:04 (fd:7) (pid:15842) Socket to <10.0.0.254:16701> already in 
cache, reusing
3/13 17:32:04 (fd:7) (pid:15842)     Sending SEND_JOB_INFO/eom
3/13 17:32:04 (fd:7) (pid:15842)     Getting reply from schedd ...
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842)     Got JOB_INFO command; getting 
classad/eom
3/13 17:32:04 (fd:7) (pid:15842)     Request 07650.00000:
3/13 17:32:04 (fd:7) (pid:15842)       Rejected 7650.0 userA@cluster 
<10.0.0.254:16701>: no match found
3/13 17:32:04 (fd:7) (pid:15842)     Sending SEND_JOB_INFO/eom
3/13 17:32:04 (fd:7) (pid:15842)     Getting reply from schedd ...
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842)     Got NO_MORE_JOBS;  done negotiating
3/13 17:32:04 (fd:7) (pid:15842)   Schedd userA@cluster got all it wants; 
removing it.
3/13 17:32:04 (fd:7) (pid:15842)   Negotiating with userB@cluster at 
<10.0.0.254:16701>
3/13 17:32:04 (fd:7) (pid:15842) 0 seconds so far
3/13 17:32:04 (fd:7) (pid:15842) NEGOTIATOR_IGNORE_USER_PRIORITIES is 
undefined, using default value of False
3/13 17:32:04 (fd:7) (pid:15842)   Calculating schedd limit with the 
following parameters
3/13 17:32:04 (fd:7) (pid:15842)     ScheddPrio       = 7.959589
3/13 17:32:04 (fd:7) (pid:15842)     ScheddPrioFactor = 1.000000
3/13 17:32:04 (fd:7) (pid:15842)     scheddShare      = 0.340399
3/13 17:32:04 (fd:7) (pid:15842)     scheddAbsShare   = 0.500000
3/13 17:32:04 (fd:7) (pid:15842)     ScheddUsage      = 0
3/13 17:32:04 (fd:7) (pid:15842)     scheddLimit      = 20
3/13 17:32:04 (fd:7) (pid:15842)     MaxscheddLimit   = 20
3/13 17:32:04 (fd:7) (pid:15842) Socket to <10.0.0.254:16701> already in 
cache, reusing
3/13 17:32:04 (fd:7) (pid:15842)     Sending SEND_JOB_INFO/eom
3/13 17:32:04 (fd:7) (pid:15842)     Getting reply from schedd ...
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842)     Got JOB_INFO command; getting 
classad/eom
3/13 17:32:04 (fd:7) (pid:15842)     Request 07524.00000:
3/13 17:32:04 (fd:7) (pid:15842)       Rejected 7524.0 userB@cluster 
<10.0.0.254:16701>: insufficient priority
3/13 17:32:04 (fd:7) (pid:15842)     Sending SEND_JOB_INFO/eom
3/13 17:32:04 (fd:7) (pid:15842)     Getting reply from schedd ...
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfds=7
3/13 17:32:04 (fd:7) (pid:15842) condor_read(): nfound=1
3/13 17:32:04 (fd:7) (pid:15842)     Got NO_MORE_JOBS;  done negotiating
3/13 17:32:04 (fd:7) (pid:15842)   Schedd userB@cluster got all it wants; 
removing it.

A's waiting jobs are rejected due to unavailable matches (as expected). 
However, B's waiting jobs are rejected due to "insufficient priority". I 
don't understand why. I also don't understand how the reported values of 
scheddAbsShare, scheddLimit and MaxscheddLimit are computed and what they 
mean. Finally, I am suspicious about the "Schedd ... got all it wants" 
messages - there is more than one job of each user waiting in the queue, 
so why isn't negotiator trying to match all of these jobs?

Best regards,
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
Betriebliches Informationsmanagement
Escherweg 2  - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: Jan.Ploski@xxxxxxxx - URL: http://www.offis.de