Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor eviction

Date: Fri, 03 Feb 2006 13:03:54 -0600 (CST)
From: Steven Timm <timm@xxxxxxxx>
Subject: Re: [Condor-users] Condor eviction

Look at SchedLog and ShadowLog.. I have seen this happen when
there are problems on the Schedd communicating to the node in question.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

On Fri, 3 Feb 2006, Joseph Turian wrote:

Hi,

One of the machines in our cluster evicts jobs with no explanation.
I'm really getting sick of it, so I'm trying to troubleshoot it.

The last eviction occurred at 2/3 13:04:23

MasterLog hasn't changed since 2/3 02:23:50

Here's the relevant portion of the StartLog:
2/3 12:46:10 DaemonCore: Command received via TCP from host
<128.122.140.85:59789>
2/3 12:46:10 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 12:46:10 vm1: Called deactivate_claim_forcibly()
2/3 12:46:10 Starter pid 10892 exited with status 0
2/3 12:46:10 vm1: State change: starter exited
2/3 12:46:10 vm1: Changing activity: Busy -> Idle
2/3 12:46:11 DaemonCore: Command received via TCP from host
<128.122.140.85:59793>
2/3 12:46:11 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
2/3 12:46:11 vm1: Got activate_claim request from shadow
(<128.122.140.85:59793>)
2/3 12:46:11 vm1: Remote job ID is 70685.0
2/3 12:46:11 vm1: Got universe "VANILLA" (5) from request classad
2/3 12:46:11 vm1: State change: claim-activation protocol successful
2/3 12:46:11 vm1: Changing activity: Idle -> Busy
2/3 13:04:09 vm2: State change: claim timed out (condor_schedd gone?)
2/3 13:04:09 vm2: Changing state and activity: Claimed/Busy ->
Preempting/Killing
2/3 13:04:09 vm1: State change: claim timed out (condor_schedd gone?)
2/3 13:04:09 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Killing
2/3 13:04:10 DaemonCore: Command received via TCP from host
<128.122.140.85:59834>
2/3 13:04:10 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 13:04:10 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
2/3 13:04:10 Starter pid 10896 exited with status 0
2/3 13:04:11 vm2: State change: starter exited
2/3 13:04:11 vm2: State change: No preempting claim, returning to owner
2/3 13:04:11 vm2: Changing state and activity: Preempting/Killing -> Owner/Idle
2/3 13:04:11 vm2: State change: IS_OWNER is false
2/3 13:04:11 vm2: Changing state: Owner -> Unclaimed
2/3 13:04:11 DaemonCore: Command received via TCP from host
<128.122.140.85:59839>
2/3 13:04:11 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/3 13:04:11 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
2/3 13:04:11 Starter pid 11002 exited with status 0
2/3 13:04:11 vm1: State change: starter exited
2/3 13:04:11 vm1: State change: No preempting claim, returning to owner
2/3 13:04:11 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
2/3 13:04:11 vm1: State change: IS_OWNER is false
2/3 13:04:11 vm1: Changing state: Owner -> Unclaimed
2/3 13:04:15 State change: RunBenchmarks is TRUE
2/3 13:04:15 vm1: Changing activity: Idle -> Benchmarking
2/3 13:04:19 State change: benchmarks completed
2/3 13:04:19 vm1: Changing activity: Benchmarking -> Idle
2/3 13:04:19 State change: RunBenchmarks is TRUE
2/3 13:04:19 vm2: Changing activity: Idle -> Benchmarking
2/3 13:04:23 State change: benchmarks completed
2/3 13:04:23 vm2: Changing activity: Benchmarking -> Idle
2/3 13:09:22 DaemonCore: Command received via TCP from host
<128.122.140.85:59857>
2/3 13:09:22 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)


In StarterLog.vm1, I see:
2/3 13:04:09 Got SIGQUIT.  Performing fast shutdown.
2/3 13:04:09 ShutdownFast all jobs.
2/3 13:04:10 Process exited, pid=11003, signal=9
2/3 13:04:10 Last process exited, now Starter is exiting
2/3 13:04:10 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0



Any ideas what's going on?

Thanks,
Joseph


--
http://www.cs.nyu.edu/~turian/

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Follow-Ups:
- Re: [Condor-users] Condor eviction
  - From: Joseph Turian

References:
- [Condor-users] Condor eviction
  - From: Joseph Turian

Prev by Date: Re: [Condor-users] Disk Quota
Next by Date: Re: [Condor-users] Condor eviction
Previous by thread: [Condor-users] Condor eviction
Next by thread: Re: [Condor-users] Condor eviction
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor eviction