[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Preemption question




I have a condor pool where most of the machines are set to
never pre-empt.  I thought that this setting would mean that
pre-emption doesn't happen but it  appears I am wrong.

On 15 of my machines I have the following settings
(and condor_config_val acknowledges they are seen both
by the startd on the machine and by the negotiator/collector).

[root@fnpc182 log]# condor_config_val -startd PREEMPT
FALSE
[root@fnpc182 log]# condor_config_val -startd PREEMPTION_REQUIREMENTS
FALSE
[root@fnpc182 log]# condor_config_val -startd START
TRUE
[root@fnpc182 log]# condor_config_val -startd RANK
(agroup == "group_numi" ) * 1000


What I want to happen is to give this machine priority of
starting jobs from group_numi, (agroup is a group attribute that
I set in the classads of all jobs).  But I don't want it to
pre-empt an existing job of some other group if that job is
not yet finished yet.

What is actually happening is the following:

From StartLog
3/23 13:22:56 DaemonCore: Command received via UDP from host <131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), calling handler (co
mmand_match_info)
3/23 13:22:56 vm1: match_info called
3/23 13:22:56 DaemonCore: Command received via UDP from host <131.225.167.42:198
21>
3/23 13:22:56 DaemonCore: received command 440 (MATCH_INFO), calling handler (co
mmand_match_info)
3/23 13:22:56 vm2: match_info called
3/23 13:22:56 DaemonCore: Command received via TCP from host <131.225.167.42:307
85>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
3/23 13:22:56 vm1: Preempting claim has correct ClaimId.
3/23 13:22:56 vm1: New claim has sufficient rank, preempting current claim.
3/23 13:22:56 vm1: State change: preempting claim based on machine rank
3/23 13:22:56 vm1: State change: retiring due to preempting claim
3/23 13:22:56 vm1: Changing activity: Busy -> Retiring
3/23 13:22:56 vm1: State change: retirement ended/expired
3/23 13:22:56 vm1: Changing state and activity: Claimed/Retiring -> Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host <131.225.167.42:307
86>
3/23 13:22:56 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
3/23 13:22:56 vm2: Preempting claim has correct ClaimId.
3/23 13:22:56 vm2: New claim has sufficient rank, preempting current claim.
3/23 13:22:56 vm2: State change: preempting claim based on machine rank
3/23 13:22:56 vm2: State change: retiring due to preempting claim
3/23 13:22:56 vm2: Changing activity: Busy -> Retiring
3/23 13:22:56 vm2: State change: retirement ended/expired
3/23 13:22:56 vm2: Changing state and activity: Claimed/Retiring -> Preempting/V
acating
3/23 13:22:56 DaemonCore: Command received via TCP from host <131.225.167.42:307
94>
3/23 13:22:56 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 Starter pid 4093 exited with status 0
3/23 13:22:56 vm1: State change: preempting claim exists - START is true or unde
fined
3/23 13:22:56 vm1: Remote owner is rubin@xxxxxxxx
3/23 13:22:56 vm1: State change: claiming protocol successful
3/23 13:22:56 vm1: Changing state and activity: Preempting/Vacating -> Claimed/I
dle
3/23 13:22:56 DaemonCore: Command received via TCP from host <131.225.167.42:307
96>
3/23 13:22:56 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
3/23 13:22:56 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/23 13:22:56 DaemonCore: Command received via UDP from host <131.225.167.42:198
49>
3/23 13:22:56 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
3/23 13:22:56 Warning: can't find resource with ClaimId (<131.225.167.182:22866>
#1142441053#75)
3/23 13:22:57 DaemonCore: Command received via UDP from host <131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:22:57 DaemonCore: Command received via UDP from host <131.225.167.42:198
49>
3/23 13:22:57 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
3/23 13:22:57 vm2: Got RELEASE_CLAIM while in Preempting state, ignoring.
3/23 13:23:01 DaemonCore: Command received via TCP from host <131.225.167.42:308
56>
3/23 13:23:01 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)

\
and in NegotiatorLog it indicated that there was indeed
a job from a user in group_numi, with priority 16, who pre-empted
the existing job from a user not in group_numi, at the time had a priority
of 160.

How do we beat this, is there any way to give preference for
starting jobs without having pre-emption go on?

Steve Timm



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team