Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preemption question

Date: Fri, 24 Mar 2006 11:22:22 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Preemption question

Something I didn't make very clear in my previous post: althoughMaxJobRetirementTime limits the preemption of jobs within the timewindow you specify, it does not interfere with the preemption ofresource claims. Therefore, you can go ahead and enable the variousforms of preemption that you want, while still providing a guarantee tojobs that they can expect to run uninterrupted for a certain amount oftime (barring events beyond the control of Condor). A claim that ispreempted will not accept any new jobs; instead, it enters a retirementphase where it waits for the final job to finish or for the retirementtime to run out.


--Dan

Dan Bradley wrote:

There are a number of different possible causes of preemption in Condor,and your policy eliminates most but not all of them. The startd RANKexpression is treated as an overriding directive by the negotiator,trumping the normal user-priority based calculations (and thereforePREEMPTION_REQUIREMENTS). This means that your policy will causeprecisely the kind of preemption that you have observed--members ofgroup "numi" will preempt other users.
One solution to this is to use MaxJobRetirementTime. This allows you tohave preemption of resource claims without having killing of jobs. Theexpression specifies the number of seconds since the job started runningthat the job will be allowed to run without interruption from killsignals, even if the claim is preempted. This applies to all forms ofpreemption, including startd RANK preemption.
If you do decide to use this policy mechanism, then you could considerturning back on PREEMPTION_REQUIREMENTS, which allows the normal fairshare algorithm to adjust resource claims. If you disable this form ofpreemption, then the problem is that once a user gets a claim to amachine, the schedd may hang on to it indefinitely if the user keepsenough jobs waiting in the queue. Another way to solve that is to useCLAIM_WORKLIFE to set an upper bound on how long a claim will keepaccepting more jobs.
--Dan

Steven Timm wrote:
I have a condor pool where most of the machines are set to
never pre-empt.  I thought that this setting would mean that
pre-emption doesn't happen but it  appears I am wrong.

On 15 of my machines I have the following settings
(and condor_config_val acknowledges they are seen both
by the startd on the machine and by the negotiator/collector).

[root@fnpc182 log]# condor_config_val -startd PREEMPT
FALSE
[root@fnpc182 log]# condor_config_val -startd PREEMPTION_REQUIREMENTS
FALSE
[root@fnpc182 log]# condor_config_val -startd START
TRUE
[root@fnpc182 log]# condor_config_val -startd RANK
(agroup == "group_numi" ) * 1000


What I want to happen is to give this machine priority of
starting jobs from group_numi, (agroup is a group attribute that
I set in the classads of all jobs).  But I don't want it to
pre-empt an existing job of some other group if that job is
not yet finished yet.

What is actually happening is the following:
From StartLog
3/23 13:22:56 DaemonCore: Command received via UDP from host<131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), callinghandler (co
mmand_match_info)
3/23 13:22:56 vm1: match_info called
3/23 13:22:56 DaemonCore: Command received via UDP from host<131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), callinghandler (co
mmand_match_info)
3/23 13:22:56 vm2: match_info called
3/23 13:22:56 DaemonCore: Command received via TCP from host<131.225.167.42:307
85>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), callinghandler
(command_request_claim)
3/23 13:22:56 vm1: Preempting claim has correct ClaimId.
3/23 13:22:56 vm1: New claim has sufficient rank, preempting currentclaim.
3/23 13:22:56 vm1: State change: preempting claim based on machine rank
3/23 13:22:56 vm1: State change: retiring due to preempting claim
3/23 13:22:56 vm1: Changing activity: Busy -> Retiring
3/23 13:22:56 vm1: State change: retirement ended/expired
3/23 13:22:56 vm1: Changing state and activity: Claimed/Retiring ->Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host<131.225.167.42:307
86>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), callinghandler
(command_request_claim)
3/23 13:22:56 vm2: Preempting claim has correct ClaimId.
3/23 13:22:56 vm2: New claim has sufficient rank, preempting currentclaim.
3/23 13:22:56 vm2: State change: preempting claim based on machine rank
3/23 13:22:56 vm2: State change: retiring due to preempting claim
3/23 13:22:56 vm2: Changing activity: Busy -> Retiring
3/23 13:22:56 vm2: State change: retirement ended/expired
3/23 13:22:56 vm2: Changing state and activity: Claimed/Retiring ->Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host<131.225.167.42:307
94>
3/23 13:22:56 DaemonCore: received command 404(DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 Starter pid 4093 exited with status 0
3/23 13:22:56 vm1: State change: preempting claim exists - START is trueor unde
fined
3/23 13:22:56 vm1: Remote owner is rubin@xxxxxxxx
3/23 13:22:56 vm1: State change: claiming protocol successful
3/23 13:22:56 vm1: Changing state and activity: Preempting/Vacating ->Claimed/I
dle
3/23 13:22:56 DaemonCore: Command received via TCP from host<131.225.167.42:307
96>
3/23 13:22:56 DaemonCore: received command 404(DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 DaemonCore: Command received via UDP from host<131.225.167.42:198
49>
3/23 13:22:56 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler
(command_release_claim)
3/23 13:22:56 Warning: can't find resource with ClaimId(<131.225.167.182:22866>
#1142441053#75)
3/23 13:22:57 DaemonCore: Command received via UDP from host<131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:22:57 DaemonCore: Command received via UDP from host<131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:23:01 DaemonCore: Command received via TCP from host<131.225.167.42:308
56>
3/23 13:23:01 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler
(command_activate_claim)

\
and in NegotiatorLog it indicated that there was indeed
a job from a user in group_numi, with priority 16, who pre-empted
the existing job from a user not in group_numi, at the time had a priority
of 160.

How do we beat this, is there any way to give preference for
starting jobs without having pre-emption go on?

Steve Timm
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Follow-Ups:
- Re: [Condor-users] Preemption question
  - From: Steven Timm

References:
- [Condor-users] Preemption question
  - From: Steven Timm
- Re: [Condor-users] Preemption question
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] Preemption question
Next by Date: [Condor-users] can't find resource with capability ?
Previous by thread: Re: [Condor-users] Preemption question
Next by thread: Re: [Condor-users] Preemption question
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Preemption question