[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor eviction



Hi,

One of the machines in our cluster evicts jobs with no explanation.
I'm really getting sick of it, so I'm trying to troubleshoot it.

The last eviction occurred at 2/3 13:04:23

MasterLog hasn't changed since 2/3 02:23:50

Here's the relevant portion of the StartLog:
2/3 12:46:10 DaemonCore: Command received via TCP from host
<128.122.140.85:59789>
2/3 12:46:10 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 12:46:10 vm1: Called deactivate_claim_forcibly()
2/3 12:46:10 Starter pid 10892 exited with status 0
2/3 12:46:10 vm1: State change: starter exited
2/3 12:46:10 vm1: Changing activity: Busy -> Idle
2/3 12:46:11 DaemonCore: Command received via TCP from host
<128.122.140.85:59793>
2/3 12:46:11 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
2/3 12:46:11 vm1: Got activate_claim request from shadow
(<128.122.140.85:59793>)
2/3 12:46:11 vm1: Remote job ID is 70685.0
2/3 12:46:11 vm1: Got universe "VANILLA" (5) from request classad
2/3 12:46:11 vm1: State change: claim-activation protocol successful
2/3 12:46:11 vm1: Changing activity: Idle -> Busy
2/3 13:04:09 vm2: State change: claim timed out (condor_schedd gone?)
2/3 13:04:09 vm2: Changing state and activity: Claimed/Busy ->
Preempting/Killing
2/3 13:04:09 vm1: State change: claim timed out (condor_schedd gone?)
2/3 13:04:09 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Killing
2/3 13:04:10 DaemonCore: Command received via TCP from host
<128.122.140.85:59834>
2/3 13:04:10 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 13:04:10 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
2/3 13:04:10 Starter pid 10896 exited with status 0
2/3 13:04:11 vm2: State change: starter exited
2/3 13:04:11 vm2: State change: No preempting claim, returning to owner
2/3 13:04:11 vm2: Changing state and activity: Preempting/Killing -> Owner/Idle
2/3 13:04:11 vm2: State change: IS_OWNER is false
2/3 13:04:11 vm2: Changing state: Owner -> Unclaimed
2/3 13:04:11 DaemonCore: Command received via TCP from host
<128.122.140.85:59839>
2/3 13:04:11 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 13:04:11 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
2/3 13:04:11 Starter pid 11002 exited with status 0
2/3 13:04:11 vm1: State change: starter exited
2/3 13:04:11 vm1: State change: No preempting claim, returning to owner
2/3 13:04:11 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
2/3 13:04:11 vm1: State change: IS_OWNER is false
2/3 13:04:11 vm1: Changing state: Owner -> Unclaimed
2/3 13:04:15 State change: RunBenchmarks is TRUE
2/3 13:04:15 vm1: Changing activity: Idle -> Benchmarking
2/3 13:04:19 State change: benchmarks completed
2/3 13:04:19 vm1: Changing activity: Benchmarking -> Idle
2/3 13:04:19 State change: RunBenchmarks is TRUE
2/3 13:04:19 vm2: Changing activity: Idle -> Benchmarking
2/3 13:04:23 State change: benchmarks completed
2/3 13:04:23 vm2: Changing activity: Benchmarking -> Idle
2/3 13:09:22 DaemonCore: Command received via TCP from host
<128.122.140.85:59857>
2/3 13:09:22 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)


In StarterLog.vm1, I see:
2/3 13:04:09 Got SIGQUIT.  Performing fast shutdown.
2/3 13:04:09 ShutdownFast all jobs.
2/3 13:04:10 Process exited, pid=11003, signal=9
2/3 13:04:10 Last process exited, now Starter is exiting
2/3 13:04:10 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0



Any ideas what's going on?

 Thanks,
 Joseph


--
http://www.cs.nyu.edu/~turian/