[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job evicted by same job cluster?



Hi all,

new day, new problem (so it seems). We *think* we have disabled job
eviction/preemption by using

PREEMPT=FALSE
PREEMPTION_REQUIREMENTS=FALSE
PREEMPTION_RANK=0
EVICT_BACKFILL=FALSE

in hopefully appropriate places. However user ABC found that his jobs
were evicted today:

StartLog:

11/18 06:34:23 slot3: match_info called
11/18 06:34:23 slot3: Received match <10.10.5.44:58074>#1226915199#39#...
11/18 06:34:23 slot3: State change: match notification protocol successful
11/18 06:34:23 slot3: Changing state: Unclaimed -> Matched
11/18 06:34:24 slot3: Request accepted.
11/18 06:34:24 slot3: Remote owner is ABC@xxxxxxxxxxx
11/18 06:34:24 slot3: State change: claiming protocol successful
11/18 06:34:24 slot3: Changing state: Matched -> Claimed
11/18 06:34:26 slot3: Got activate_claim request from shadow
(<10.20.30.2:51258>)
11/18 06:34:26 slot3: Remote job ID is 6334497.15
11/18 06:34:26 slot3: Got universe "VANILLA" (5) from request classad
11/18 06:34:26 slot3: State change: claim-activation protocol successful
11/18 06:34:26 slot3: Changing activity: Idle -> Busy
11/18 06:34:58 slot3: match_info called
11/18 06:35:33 slot3: match_info called
11/18 06:36:09 slot3: match_info called
11/18 06:36:45 slot3: match_info called
11/18 06:37:19 slot3: match_info called
11/18 06:37:19 slot3: Preempting claim has correct ClaimId.
11/18 06:37:19 slot3: New claim has sufficient rank, preempting current
claim.
11/18 06:37:19 slot3: State change: preempting claim based on user priority
11/18 06:37:19 slot3: State change: claim retirement ended/expired
11/18 06:37:19 slot3: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
11/18 06:37:19 slot3: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/18 06:37:19 Starter pid 24381 exited with status 0
11/18 06:37:19 slot3: State change: starter exited
11/18 06:37:19 slot3: State change: preempting claim exists - START is
true or undefined
11/18 06:37:19 slot3: Remote owner is ABC@xxxxxxxxxxx
11/18 06:37:19 slot3: State change: claiming protocol successful
11/18 06:37:19 slot3: Changing state and activity: Preempting/Vacating
-> Claimed/Idle
11/18 06:37:21 slot3: Got activate_claim request from shadow
(<10.20.30.2:54629>)
11/18 06:37:22 slot3: Remote job ID is 6334497.38
11/18 06:37:22 slot3: Got universe "VANILLA" (5) from request classad
11/18 06:37:22 slot3: State change: claim-activation protocol successful
11/18 06:37:22 slot3: Changing activity: Idle -> Busy

StarterLog.slot3:
11/18 06:33:31 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/18 06:34:26 ******************************************************
11/18 06:34:26 ** condor_starter (CONDOR_STARTER) STARTING UP
11/18 06:34:26 ** /opt/condor-7.0.5/sbin/condor_starter
11/18 06:34:26 ** $CondorVersion: 7.0.5 Oct 22 2008 $
11/18 06:34:26 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
11/18 06:34:26 ** PID = 24381
11/18 06:34:26 ** Log last touched 11/18 06:33:31
11/18 06:34:26 ******************************************************
11/18 06:34:26 Using config source: /opt/condor/etc/condor_config
11/18 06:34:26 Using local config sources:
11/18 06:34:26    /etc/default/condor|
11/18 06:34:26 DaemonCore: Command Socket at <10.10.5.44:49504>
11/18 06:34:26 Done setting resource limits
11/18 06:34:27 Communicating with shadow <10.20.30.2:47285>
11/18 06:34:27 Submitting machine is "h2.atlas.local"
11/18 06:34:27 setting the orig job name in starter
11/18 06:34:27 setting the orig job iwd in starter
11/18 06:34:27 Job 6334497.15 set to execute immediately
11/18 06:34:27 Starting a VANILLA universe job with ID: 6334497.15
11/18 06:34:27 IWD: /home/ABC/test_ecc
11/18 06:34:27 Renice expr "0" evaluated to 0
11/18 06:34:27 About to exec /home/ABC/test_ecc/eccentric_50.sh 15 0.0
11/18 06:34:27 Create_Process succeeded, pid=24389
11/18 06:37:19 Got SIGTERM. Performing graceful shutdown.
11/18 06:37:19 ShutdownGraceful all jobs.
11/18 06:37:19 Process exited, pid=24389, signal=15
11/18 06:37:19 Last process exited, now Starter is exiting
11/18 06:37:19 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/18 06:37:22 ******************************************************
11/18 06:37:22 ** condor_starter (CONDOR_STARTER) STARTING UP
11/18 06:37:22 ** /opt/condor-7.0.5/sbin/condor_starter
11/18 06:37:22 ** $CondorVersion: 7.0.5 Oct 22 2008 $
11/18 06:37:22 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
11/18 06:37:22 ** PID = 24632
11/18 06:37:22 ** Log last touched 11/18 06:37:19
11/18 06:37:22 ******************************************************
11/18 06:37:22 Using config source: /opt/condor/etc/condor_config
11/18 06:37:22 Using local config sources:
11/18 06:37:22    /etc/default/condor|
11/18 06:37:22 DaemonCore: Command Socket at <10.10.5.44:44237>
11/18 06:37:22 Done setting resource limits
11/18 06:37:22 Communicating with shadow <10.20.30.2:37297>
11/18 06:37:22 Submitting machine is "h2.atlas.local"
11/18 06:37:22 setting the orig job name in starter
11/18 06:37:22 setting the orig job iwd in starter
11/18 06:37:22 Job 6334497.38 set to execute immediately
11/18 06:37:22 Starting a VANILLA universe job with ID: 6334497.38
11/18 06:37:22 IWD: /home/ABC/test_ecc
11/18 06:37:22 Renice expr "0" evaluated to 0
11/18 06:37:22 About to exec /home/ABC/test_ecc/eccentric_50.sh 38 0.0
11/18 06:37:22 Create_Process succeeded, pid=24639

The cluster itself is quite busy right now, but i don't get it why one
job of user ABC is replaced with another job from this user...

I guess I'm missing something, can someone point me to something which
we miss?

Thanks

puzzled Carsten