[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] preemption / state change question



I have the following setup

preempt = false
preemption_requirements = (CurrentTime - EnteredCurrentState) < (30 * 60)
rank = <a tiering structure based on job self selection >
preemption_rank = (EnteredCurrentState - CurrentRank ) + RemoteUserPrio
(rank is small enough to normally cause the shortest running jobs to go first)

This seems to work fine most of the time but recently a job from a use
at Normal rank (10) was submitted and repeatedly ran for about 20 or
so attempts constantly getting preempted by other jobs after 5 or so
mins. All fine and as it should be.

It eventually got a decent run at things and ran for 4 hrs 45mins

At this point it was preempted by a high job (rank 100)

all vanilla universe so a bit of a pain...

The preemption rank did the right thing (all shorter jobs die before
this one) but is still should not have happened right?

There should be no additional state transitions on the vm once a job
is running until it ends (by it self or by preemption) yes?

does current time sometimes screw up? if so this is a real pain since
it means I am unable to use time based preemption requirements and
just have to switch it all off till I can get to 6.7.1.

here's the start log on the machine

9/24 02:50:56 vm1: match_info called
9/24 02:50:56 vm1: Received match <172.16.11.54:3669>#6878612797
9/24 02:50:56 vm1: State change: match notification protocol successful
9/24 02:50:56 vm1: Changing state: Unclaimed -> Matched
9/24 02:50:57 DaemonCore: Command received via TCP from host <172.16.34.10:1201>
9/24 02:50:57 DaemonCore: received command 442 (REQUEST_CLAIM),
calling handler (command_request_claim)
9/24 02:50:57 vm1: Request accepted.
9/24 02:50:57 vm1: Remote owner is Foo@xxxxx
9/24 02:50:57 vm1: State change: claiming protocol successful
9/24 02:50:57 vm1: Changing state: Matched -> Claimed
9/24 02:51:00 DaemonCore: Command received via TCP from host <172.16.34.10:1208>
9/24 02:51:00 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
9/24 02:51:00 vm1: Got activate_claim request from shadow (<172.16.34.10:1208>)
9/24 02:51:00 vm1: Remote job ID is 240.2
9/24 02:51:00 vm1: Got universe "VANILLA" (5) from request classad
9/24 02:51:00 vm1: State change: claim-activation protocol successful
9/24 02:51:00 vm1: Changing activity: Idle -> Busy
9/24 04:20:01 State change: RunBenchmarks is TRUE
9/24 04:20:01 vm2: Changing activity: Idle -> Benchmarking
9/24 04:20:05 State change: benchmarks completed
9/24 04:20:05 vm2: Changing activity: Benchmarking -> Idle
9/24 06:57:19 DaemonCore: Command received via UDP from host <172.16.11.63:3199>
9/24 06:57:19 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
9/24 06:57:19 vm1: match_info called
9/24 06:57:19 DaemonCore: Command received via TCP from host
<172.16.34.136:4765>
9/24 06:57:19 DaemonCore: received command 442 (REQUEST_CLAIM),
calling handler (command_request_claim)
9/24 06:57:19 vm1: Preempting claim has correct capability.
9/24 06:57:19 vm1: New claim has sufficient rank, preempting current claim.
9/24 06:57:19 vm1: State change: preempting claim based on machine rank
9/24 06:57:19 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
9/24 06:57:22 vm1: State change: KILL is TRUE
9/24 06:57:22 vm1: Changing activity: Vacating -> Killing
9/24 06:57:27 DaemonCore: Command received via TCP from host <172.16.34.10:1084>
9/24 06:57:27 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
9/24 06:57:27 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
9/24 06:57:27 DaemonCore: Command received via UDP from host <172.16.34.10:1087>
9/24 06:57:27 DaemonCore: received command 443 (RELEASE_CLAIM),
calling handler (command_handler)
9/24 06:57:27 vm1: Got RELEASE_CLAIM while in Preempting state, ignoring.
9/24 06:57:27 DaemonCore: Command received via UDP from host <172.16.34.10:1088>
9/24 06:57:27 DaemonCore: received command 443 (RELEASE_CLAIM),
calling handler (command_handler)
9/24 06:57:27 vm1: Got RELEASE_CLAIM while in Preempting state, ignoring.
9/24 06:57:27 DaemonCore: Command received via UDP from host <172.16.11.54:2363>
9/24 06:57:27 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
9/24 06:57:27 Starter pid 2060 exited with status 0
9/24 06:57:28 vm1: State change: starter exited
9/24 06:57:28 vm1: State change: preempting claim exists - START is
true or undefined
9/24 06:57:28 vm1: Remote owner is Bar@xxxxx
9/24 06:57:28 vm1: State change: claiming protocol successful
9/24 06:57:28 vm1: Changing state and activity: Preempting/Killing ->
Claimed/Idle
9/24 06:57:57 DaemonCore: Command received via TCP from host
<172.16.34.136:4885>
9/24 06:57:57 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
9/24 06:57:57 vm1: Got activate_claim request from shadow (<172.16.34.136:4885>)
9/24 06:57:57 vm1: Remote job ID is 655.24
9/24 06:57:57 vm1: Got universe "VANILLA" (5) from request classad
9/24 06:57:57 vm1: State change: claim-activation protocol successful
9/24 06:57:57 vm1: Changing activity: Idle -> Busy
9/24 07:12:54 DaemonCore: Command received via TCP from host
<172.16.34.136:1423>
9/24 07:12:54 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
9/24 07:12:54 vm1: Called deactivate_claim_forcibly()
9/24 07:12:54 DaemonCore: Command received via UDP from host <172.16.11.54:2433>
9/24 07:12:54 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
9/24 07:12:54 Starter pid 1884 exited with status 0
9/24 07:12:54 vm1: State change: starter exited
9/24 07:12:54 vm1: Changing activity: Busy -> Idle

I left the vm2 lines in, I assume it has no effect, that the
preempting job choose to die soon after merely adds insult to injury
:¬)

negotiator log not too interesting

9/24 06:57:12 ---------- Started Negotiation Cycle ----------
9/24 06:57:12 Phase 1:  Obtaining ads from collector ...
9/24 06:57:12   Getting all public ads ...
9/24 06:57:12   Sorting 360 ads ...
9/24 06:57:12   Getting startd private ads ...
9/24 06:57:12 Got ads: 360 public and 216 private
9/24 06:57:12 Public ads include 5 submitter, 216 startd
9/24 06:57:12 Phase 2:  Performing accounting ...
9/24 06:57:13 Phase 3:  Sorting submitter ads by priority ...
9/24 06:57:13 Phase 4.1:  Negotiating with schedds ...
9/24 06:57:13   Negotiating with Bar@xxxxx at <172.16.34.136:1035>
9/24 06:57:13     Request 00655.00000:
9/24 06:57:13       Matched 655.0 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.11.52:1725>
9/24 06:57:13       Successfully matched with vm1@xxxxxxxxxxx
9/24 06:57:13     Request 00655.00001:
9/24 06:57:13       Matched 655.1 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.11.55:3391>
9/24 06:57:13       Successfully matched with vm1@xxxxxxxxxxx
9/24 06:57:13     Request 00655.00002:
9/24 06:57:14       Matched 655.2 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.1.136:1048>
9/24 06:57:14       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:14     Request 00655.00003:
9/24 06:57:14       Matched 655.3 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.1.164:4834>
9/24 06:57:14       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:14     Request 00655.00004:
9/24 06:57:14       Matched 655.4 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.1.166:1050>
9/24 06:57:14       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:14     Request 00655.00005:
9/24 06:57:14       Matched 655.5 Bar@xxxxx <172.16.34.136:1035>
preempting none <172.16.1.176:3271>
9/24 06:57:14       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:14     Request 00655.00006:
9/24 06:57:15       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:15       Matched 655.6 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.50:4102>
9/24 06:57:15       Successfully matched with vm1@xxxxxxxxxxx
9/24 06:57:15     Request 00655.00007:
9/24 06:57:15       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:15       Matched 655.7 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.157:1044>
9/24 06:57:15       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:15     Request 00655.00008:
9/24 06:57:15       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:15       Matched 655.8 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.154:3576>
9/24 06:57:15       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:15     Request 00655.00009:
9/24 06:57:15       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:15       Matched 655.9 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.160:3631>
9/24 06:57:15       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:15     Request 00655.00010:
9/24 06:57:16       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:16       Matched 655.10 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.135:1049>
9/24 06:57:16       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:16     Request 00655.00011:
9/24 06:57:16       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:16       Matched 655.11 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.52:4999>
9/24 06:57:16       Successfully matched with vm1@xxxxxxxxxxx
9/24 06:57:16     Request 00655.00012:
9/24 06:57:16       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:16       Matched 655.12 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.53:4229>
9/24 06:57:16       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:16     Request 00655.00013:
9/24 06:57:16       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:16       Matched 655.13 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.56:1896>
9/24 06:57:16       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:16     Request 00655.00014:
9/24 06:57:17       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:17       Matched 655.14 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.57:3190>
9/24 06:57:17       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:17     Request 00655.00015:
9/24 06:57:17       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:17       Matched 655.15 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.58:2553>
9/24 06:57:17       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:17     Request 00655.00016:
9/24 06:57:17       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:17       Matched 655.16 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.59:4494>
9/24 06:57:17       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:17     Request 00655.00017:
9/24 06:57:17       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:17       Matched 655.17 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.1.149:4409>
9/24 06:57:17       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:17     Request 00655.00018:
9/24 06:57:17       Preempting AnOther@xxxxx (prio=10.54) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:17       Matched 655.18 Bar@xxxxx <172.16.34.136:1035>
preempting AnOther@xxxxx <172.16.11.57:1050>
9/24 06:57:17       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:17     Request 00655.00019:
9/24 06:57:18       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:18       Matched 655.19 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.143:1059>
9/24 06:57:18       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:18     Request 00655.00020:
9/24 06:57:18       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:18       Matched 655.20 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.153:1260>
9/24 06:57:18       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:18     Request 00655.00021:
9/24 06:57:18       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:18       Matched 655.21 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.144:2065>
9/24 06:57:18       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:18     Request 00655.00022:
9/24 06:57:18       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:18       Matched 655.22 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.1.145:2060>
9/24 06:57:18       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:18     Request 00655.00023:
9/24 06:57:19       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:19       Matched 655.23 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.11.61:1049>
9/24 06:57:19       Successfully matched with vm1@xxxxxxxxxxxx
9/24 06:57:19     Request 00655.00024:
9/24 06:57:19       Preempting Foo@xxxxx (prio=16.89) on
vm1@xxxxxxxxxxx for Bar@xxxxx (prio=8.12)
9/24 06:57:19       Matched 655.24 Bar@xxxxx <172.16.34.136:1035>
preempting Foo@xxxxx <172.16.11.54:3669>
9/24 06:57:19       Successfully matched with vm1@xxxxxxxxxxx
9/24 06:57:19     Got NO_MORE_JOBS;  done negotiating
9/24 06:57:19 ---------- Finished Negotiation Cycle ----------

Any idea why this happened, the actual clock is fine (ntp sync with
drift not jump settings)...

Matt