[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs DELAYED/REFUSED/EVICTED




I am seeing the following behavior in my cluster, still running
Condor 6.7.18, and have been seeing it for quite some time.
Note that Preemption_requirements, RANK, and START requirements
on my cluster are such that jobs shouldn't be getting pre-empted
or evicted at all.



Here are log entries for a single job. It appears that this
job was matched to a node, then evicted, but then successsfully started
1 minute later on the very same node (and is still running successfully
on said node as I write this E-mail.)

Is there anything to be concerned about? For the most part the cluster
is staying claimed and jobs are running.  Just wondering what is causing
the apparent glitch.

There have been 70 jobs thus evicted on a cluster of some 370 VM's
with a queue that's quite full.  It appears that most of the evictions,
like this one, happened before the job actually started.

Steve Timm



ShadowLog

6/23 11:30:51 Initializing a VANILLA shadow for job 397335.0
6/23 11:30:51 (397335.0) (8778): Request to run on <131.225.167.166:34031> was DELAYED 6/23 11:30:52 (397335.0) (8778): Request to run on <131.225.167.166:34031> was DELAYED 6/23 11:30:53 (397335.0) (8778): Request to run on <131.225.167.166:34031> was REFUSED
6/23 11:30:53 (397335.0) (8778): Job 397335.0 is being evicted
6/23 11:30:53 (397335.0) (8778): logEvictEvent with unknown reason (108), aborting 6/23 11:30:53 (397335.0) (8778): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
6/23 11:31:25 Initializing a VANILLA shadow for job 397335.0
6/23 11:31:25 (397335.0) (9423): Request to run on <131.225.167.166:34031> was ACCEPTED
6/23 11:31:26 (397335.0) (9423): Asked to write event of number 1.
6/23 11:31:34 (397335.0) (9423): DaemonCore: Command received via UDP from host<131.225.167.166:19198> 6/23 11:31:34 (397335.0) (9423): DaemonCore: received command 71000 (SHADOW_UPDATEINFO), calling handler (UniShadow::updateFromStarter)
6/23 11:31:34 (397335.0) (9423): Asked to write event of number 6


SchedLog


6/23 11:30:51 (pid:12870) Starting add_shadow_birthdate(397335.0)
6/23 11:30:51 (pid:12870) Started shadow for job 397335.0 on "<131.225.167.166:34031>", (shadow pid = 8778) 6/23 11:30:52 (pid:12870) Match record (<131.225.167.166:34031>, 397335, 0) deleted 6/23 11:30:53 (pid:12870) Shadow pid 8778 for job 397335.0 exited with status 108
6/23 11:31:25 (pid:12870) Starting add_shadow_birthdate(397335.0)
6/23 11:31:25 (pid:12870) Started shadow for job 397335.0 on "<131.225.167.166:34031>", (shadow pid = 9423)


MatchLog:

6/23 11:30:21     Request 397335.00000:
6/23 11:30:21 Rejected 397335.0 group_e898.minboone@xxxxxxxx <131.225.167.42:33246>: insufficient priority
6/23 11:31:20     Request 397335.00000:


StartLog
6/23 00:31:11 DaemonCore: Command received via TCP from host <131.225.167.42:197
69>
6/23 00:31:11 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
6/23 00:31:11 vm2: Called deactivate_claim_forcibly()
6/23 00:31:11 Starter pid 28613 exited with status 0
6/23 00:31:11 vm2: State change: starter exited
6/23 00:31:11 vm2: Changing activity: Busy -> Idle
6/23 00:31:11 vm2: State change: idle claim shutting down due to CLAIM_WORKLIFE 6/23 00:31:11 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacat
ing
6/23 00:31:11 vm2: State change: No preempting claim, returning to owner
6/23 00:31:11 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idl
e
6/23 00:31:11 vm2: State change: IS_OWNER is false
6/23 00:31:11 vm2: Changing state: Owner -> Unclaimed
6/23 00:31:11 State change: RunBenchmarks is TRUE
6/23 00:31:11 vm2: Changing activity: Idle -> Benchmarking
6/23 00:31:16 State change: benchmarks completed
6/23 00:31:16 vm2: Changing activity: Benchmarking -> Idle
6/23 00:31:16 DaemonCore: Command received via UDP from host <131.225.167.42:377
12>
6/23 00:31:16 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
6/23 00:31:16 Warning: can't find resource with ClaimId (<131.225.167.166:34031>
#1150905923#17)
6/23 00:35:18 DaemonCore: Command received via UDP from host <131.225.167.47:105
11>
6/23 00:35:18 DaemonCore: received command 440 (MATCH_INFO), calling handler (co
mmand_match_info)
6/23 00:35:18 vm2: match_info called
6/23 00:35:18 vm2: Received match <131.225.167.166:34031>#1150905923#21
6/23 00:35:18 vm2: State change: match notification protocol successful
6/23 00:35:18 vm2: Changing state: Unclaimed -> Matched
6/23 00:35:19 DaemonCore: Command received via TCP from host <131.225.167.42:200
86>
6/23 00:35:19 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
6/23 00:35:19 vm2: Request accepted.
6/23 00:35:19 vm2: Remote owner is minboone@xxxxxxxx
6/23 00:35:19 vm2: State change: claiming protocol successful
6/23 00:35:19 vm2: Changing state: Matched -> Claimed
6/23 00:35:25 DaemonCore: Command received via TCP from host <131.225.167.42:201
14>
6/23 00:35:25 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
6/23 00:35:25 vm2: Got activate_claim request from shadow (<131.225.167.42:20114
)
6/23 00:35:25 vm2: Remote job ID is 395170.0
6/23 00:35:25 vm2: Got universe "VANILLA" (5) from request classad
6/23 00:35:25 vm2: State change: claim-activation protocol successful
6/23 00:35:25 vm2: Changing activity: Idle -> Busy
6/23 11:30:48 DaemonCore: Command received via TCP from host <131.225.167.42:368
15>
6/23 11:30:48 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
6/23 11:30:48 vm2: Called deactivate_claim_forcibly()
6/23 11:30:51 DaemonCore: Command received via TCP from host <131.225.167.42:368
18>
6/23 11:30:51 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
6/23 11:30:51 vm2: Got activate claim while starter is still alive.
6/23 11:30:51 vm2: Telling shadow to try again later.
6/23 11:30:52 DaemonCore: Command received via TCP from host <131.225.167.42:368
21>
6/23 11:30:52 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
6/23 11:30:52 vm2: Got activate claim while starter is still alive.
6/23 11:30:52 vm2: Telling shadow to try again later.
6/23 11:30:52 Starter pid 13410 exited with status 0
6/23 11:30:52 vm2: State change: starter exited
6/23 11:30:52 vm2: Changing activity: Busy -> Idle
6/23 11:30:52 vm2: State change: idle claim shutting down due to CLAIM_WORKLIFE 6/23 11:30:52 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacat
ing
6/23 11:30:52 vm2: State change: No preempting claim, returning to owner
6/23 11:30:52 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idl
e
6/23 11:30:52 vm2: State change: IS_OWNER is false
6/23 11:30:52 vm2: Changing state: Owner -> Unclaimed
6/23 11:30:52 DaemonCore: Command received via UDP from host <131.225.167.42:228
44>
6/23 11:30:52 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
6/23 11:30:52 Warning: can't find resource with ClaimId (<131.225.167.166:34031>
#1150905923#21)
6/23 11:30:53 DaemonCore: Command received via TCP from host <131.225.167.42:368
22>
6/23 11:30:53 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
6/23 11:30:53 Error: can't find resource with ClaimId (<131.225.167.166:34031>#1
150905923#21)
6/23 11:30:53 DaemonCore: Command received via TCP from host <131.225.167.42:368
23>
6/23 11:30:53 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
6/23 11:30:53 Error: can't find resource with ClaimId (<131.225.167.166:34031>#1
150905923#21)
6/23 11:31:20 DaemonCore: Command received via UDP from host <131.225.167.47:106
88>
6/23 11:31:20 DaemonCore: received command 440 (MATCH_INFO), calling handler (co
mmand_match_info)
6/23 11:31:20 vm2: match_info called
6/23 11:31:20 vm2: Received match <131.225.167.166:34031>#1150905923#25
6/23 11:31:20 vm2: State change: match notification protocol successful
6/23 11:31:20 vm2: Changing state: Unclaimed -> Matched
6/23 11:31:21 DaemonCore: Command received via TCP from host <131.225.167.42:369
31>
6/23 11:31:21 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
6/23 11:31:21 vm2: Request accepted.
6/23 11:31:21 vm2: Remote owner is minboone@xxxxxxxx
6/23 11:31:21 vm2: State change: claiming protocol successful
6/23 11:31:21 vm2: Changing state: Matched -> Claimed
6/23 11:31:25 DaemonCore: Command received via TCP from host <131.225.167.42:369
59>
6/23 11:31:25 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
6/23 11:31:25 vm2: Got activate_claim request from shadow (<131.225.167.42:36959
)
6/23 11:31:25 vm2: Remote job ID is 397335.0
6/23 11:31:26 vm2: Got universe "VANILLA" (5) from request classad
6/23 11:31:26 vm2: State change: claim-activation protocol successful
6/23 11:31:26 vm2: Changing activity: Idle -> Busy




StarterLog.vm2:

6/23 00:35:25 ******************************************************
6/23 00:35:25 ** condor_starter (CONDOR_STARTER) STARTING UP
6/23 00:35:25 ** /opt/condor-6.7.18/sbin/condor_starter
6/23 00:35:25 ** $CondorVersion: 6.7.18 Mar 22 2006 $
6/23 00:35:25 ** $CondorPlatform: I386-LINUX_RH9 $
6/23 00:35:25 ** PID = 13410
6/23 00:35:25 ******************************************************
6/23 00:35:25 Using config file: /etc/condor/condor_config
6/23 00:35:25 Using local config files: /opt/condor/etc/group_params.config /opt
/condor/local/condor_config.local
6/23 00:35:25 DaemonCore: Command Socket at <131.225.167.166:36256>
6/23 00:35:25 Done setting resource limits
6/23 00:35:25 Communicating with shadow <131.225.167.42:20111>
6/23 00:35:25 Submitting machine is "fngp-osg.fnal.gov"
6/23 00:35:25 Starting a VANILLA universe job with ID: 395170.0
6/23 00:35:25 IWD: /home/e898dev
6/23 00:35:25 Output file: /home/minboone/.globus/job/fngp-osg.fnal.gov/30219.11
50987622/stdout
6/23 00:35:25 Error file: /home/minboone/.globus/job/fngp-osg.fnal.gov/30219.115
0987622/stderr
6/23 00:35:25 About to exec /farm/e898_stage0/tmp/polly/may06_pid_cocktail_0313/
input/may06_pid_cocktail_0313.sh condor_exec.exe
6/23 00:35:25 Create_Process succeeded, pid=13413
6/23 11:30:48 Process exited, pid=13413, status=0
6/23 11:30:48 Got SIGQUIT.  Performing fast shutdown.
6/23 11:30:48 ShutdownFast all jobs.
6/23 11:30:52 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
6/23 11:31:26 ******************************************************
6/23 11:31:26 ** condor_starter (CONDOR_STARTER) STARTING UP
6/23 11:31:26 ** /opt/condor-6.7.18/sbin/condor_starter
6/23 11:31:26 ** $CondorVersion: 6.7.18 Mar 22 2006 $
6/23 11:31:26 ** $CondorPlatform: I386-LINUX_RH9 $
6/23 11:31:26 ** PID = 2899
6/23 11:31:26 ******************************************************
6/23 11:31:26 Using config file: /etc/condor/condor_config
6/23 11:31:26 Using local config files: /opt/condor/etc/group_params.config /opt
/condor/local/condor_config.local
6/23 11:31:26 DaemonCore: Command Socket at <131.225.167.166:36841>
6/23 11:31:26 Done setting resource limits
6/23 11:31:26 Communicating with shadow <131.225.167.42:36958>
6/23 11:31:26 Submitting machine is "fngp-osg.fnal.gov"
6/23 11:31:26 Starting a VANILLA universe job with ID: 397335.0
6/23 11:31:26 IWD: /home/e898dev
6/23 11:31:26 Output file: /home/minboone/.globus/job/fngp-osg.fnal.gov/30591.11
51066691/stdout
6/23 11:31:26 Error file: /home/minboone/.globus/job/fngp-osg.fnal.gov/30591.115
1066691/stderr
6/23 11:31:26 About to exec /farm/e898_stage0/tmp/polly/may06_cocktail_dirt_0699
/input/may06_cocktail_dirt_0699.sh condor_exec.exe
6/23 11:31:26 Create_Process succeeded, pid=2900


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team