[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preemption question




Is there some place in the condor manual where the whole notion
of a resource claim is explained.  The whole idea that a resource
claim could survive the finishing of one job and starting of a new
job was one that hadn't occurred to me, nor had it been obvious
from reading the manuals.

What is the purpose of this feature in condor, why would you ever
want to have a claimed resource stay claimed across job boundaries?
I need to understand this before I pick which way to resolve this problem.

Also one of my colleagues suggested that if I add
 (Activity != "Busy") to the rank clause on the machines which
have a rank clause, then I would get the desired effect I want.  Does
that make any sense to anyone?

Steve Timm


On Fri, 24 Mar 2006, Dan Bradley wrote:


Something I didn't make very clear in my previous post: although
MaxJobRetirementTime limits the preemption of jobs within the time
window you specify, it does not interfere with the preemption of
resource claims.  Therefore, you can go ahead and enable the various
forms of preemption that you want, while still providing a guarantee to
jobs that they can expect to run uninterrupted for a certain amount of
time (barring events beyond the control of Condor).  A claim that is
preempted will not accept any new jobs; instead, it enters a retirement
phase where it waits for the final job to finish or for the retirement
time to run out.

--Dan

Dan Bradley wrote:

There are a number of different possible causes of preemption in Condor,
and your policy eliminates most but not all of them.  The startd RANK
expression is treated as an overriding directive by the negotiator,
trumping the normal user-priority based calculations (and therefore
PREEMPTION_REQUIREMENTS).  This means that your policy will cause
precisely the kind of preemption that you have observed--members of
group "numi" will preempt other users.

One solution to this is to use MaxJobRetirementTime.  This allows you to
have preemption of resource claims without having killing of jobs.  The
expression specifies the number of seconds since the job started running
that the job will be allowed to run without interruption from kill
signals, even if the claim is preempted.  This applies to all forms of
preemption, including startd RANK preemption.

If you do decide to use this policy mechanism, then you could consider
turning back on PREEMPTION_REQUIREMENTS, which allows the normal fair
share algorithm to adjust resource claims.  If you disable this form of
preemption, then the problem is that once a user gets a claim to a
machine, the schedd may hang on to it indefinitely if the user keeps
enough jobs waiting in the queue.  Another way to solve that is to use
CLAIM_WORKLIFE to set an upper bound on how long a claim will keep
accepting more jobs.

--Dan

Steven Timm wrote:



I have a condor pool where most of the machines are set to
never pre-empt.  I thought that this setting would mean that
pre-emption doesn't happen but it  appears I am wrong.

On 15 of my machines I have the following settings
(and condor_config_val acknowledges they are seen both
by the startd on the machine and by the negotiator/collector).

[root@fnpc182 log]# condor_config_val -startd PREEMPT
FALSE
[root@fnpc182 log]# condor_config_val -startd PREEMPTION_REQUIREMENTS
FALSE
[root@fnpc182 log]# condor_config_val -startd START
TRUE
[root@fnpc182 log]# condor_config_val -startd RANK
(agroup == "group_numi" ) * 1000


What I want to happen is to give this machine priority of
starting jobs from group_numi, (agroup is a group attribute that
I set in the classads of all jobs).  But I don't want it to
pre-empt an existing job of some other group if that job is
not yet finished yet.

What is actually happening is the following:

From StartLog
3/23 13:22:56 DaemonCore: Command received via UDP from host
<131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), calling
handler (co
mmand_match_info)
3/23 13:22:56 vm1: match_info called
3/23 13:22:56 DaemonCore: Command received via UDP from host
<131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), calling
handler (co
mmand_match_info)
3/23 13:22:56 vm2: match_info called
3/23 13:22:56 DaemonCore: Command received via TCP from host
<131.225.167.42:307
85>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler
(command_request_claim)
3/23 13:22:56 vm1: Preempting claim has correct ClaimId.
3/23 13:22:56 vm1: New claim has sufficient rank, preempting current
claim.
3/23 13:22:56 vm1: State change: preempting claim based on machine rank
3/23 13:22:56 vm1: State change: retiring due to preempting claim
3/23 13:22:56 vm1: Changing activity: Busy -> Retiring
3/23 13:22:56 vm1: State change: retirement ended/expired
3/23 13:22:56 vm1: Changing state and activity: Claimed/Retiring ->
Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host
<131.225.167.42:307
86>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler
(command_request_claim)
3/23 13:22:56 vm2: Preempting claim has correct ClaimId.
3/23 13:22:56 vm2: New claim has sufficient rank, preempting current
claim.
3/23 13:22:56 vm2: State change: preempting claim based on machine rank
3/23 13:22:56 vm2: State change: retiring due to preempting claim
3/23 13:22:56 vm2: Changing activity: Busy -> Retiring
3/23 13:22:56 vm2: State change: retirement ended/expired
3/23 13:22:56 vm2: Changing state and activity: Claimed/Retiring ->
Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host
<131.225.167.42:307
94>
3/23 13:22:56 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 Starter pid 4093 exited with status 0
3/23 13:22:56 vm1: State change: preempting claim exists - START is true
or unde
fined
3/23 13:22:56 vm1: Remote owner is rubin@xxxxxxxx
3/23 13:22:56 vm1: State change: claiming protocol successful
3/23 13:22:56 vm1: Changing state and activity: Preempting/Vacating ->
Claimed/I
dle
3/23 13:22:56 DaemonCore: Command received via TCP from host
<131.225.167.42:307
96>
3/23 13:22:56 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 DaemonCore: Command received via UDP from host
<131.225.167.42:198
49>
3/23 13:22:56 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler
(command_release_claim)
3/23 13:22:56 Warning: can't find resource with ClaimId
(<131.225.167.182:22866>
#1142441053#75)
3/23 13:22:57 DaemonCore: Command received via UDP from host
<131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:22:57 DaemonCore: Command received via UDP from host
<131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:23:01 DaemonCore: Command received via TCP from host
<131.225.167.42:308
56>
3/23 13:23:01 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler
(command_activate_claim)

\
and in NegotiatorLog it indicated that there was indeed
a job from a user in group_numi, with priority 16, who pre-empted
the existing job from a user not in group_numi, at the time had a priority
of 160.

How do we beat this, is there any way to give preference for
starting jobs without having pre-emption go on?

Steve Timm







_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team