Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs DELAYED/REFUSED/EVICTED

Date: Fri, 23 Jun 2006 12:55:32 -0500 (CDT)
From: Steven Timm <timm@xxxxxxxx>
Subject: [Condor-users] Jobs DELAYED/REFUSED/EVICTED


I am seeing the following behavior in my cluster, still running
Condor 6.7.18, and have been seeing it for quite some time.
Note that Preemption_requirements, RANK, and START requirements
on my cluster are such that jobs shouldn't be getting pre-empted
or evicted at all.



Here are log entries for a single job. It appears that this
job was matched to a node, then evicted, but then successsfully started
1 minute later on the very same node (and is still running successfully
on said node as I write this E-mail.)

Is there anything to be concerned about? For the most part the cluster
is staying claimed and jobs are running.  Just wondering what is causing
the apparent glitch.

There have been 70 jobs thus evicted on a cluster of some 370 VM's
with a queue that's quite full.  It appears that most of the evictions,
like this one, happened before the job actually started.

Steve Timm



ShadowLog

6/23 11:30:51 Initializing a VANILLA shadow for job 397335.0

6/23 11:30:51 (397335.0) (8778): Request to run on <131.225.167.166:34031>was DELAYED6/23 11:30:52 (397335.0) (8778): Request to run on <131.225.167.166:34031>was DELAYED6/23 11:30:53 (397335.0) (8778): Request to run on <131.225.167.166:34031>was REFUSED

6/23 11:30:53 (397335.0) (8778): Job 397335.0 is being evicted

6/23 11:30:53 (397335.0) (8778): logEvictEvent with unknown reason (108),aborting6/23 11:30:53 (397335.0) (8778): **** condor_shadow (condor_SHADOW)EXITING WITH STATUS 108

6/23 11:31:25 Initializing a VANILLA shadow for job 397335.0

6/23 11:31:25 (397335.0) (9423): Request to run on <131.225.167.166:34031>was ACCEPTED

6/23 11:31:26 (397335.0) (9423): Asked to write event of number 1.

6/23 11:31:34 (397335.0) (9423): DaemonCore: Command received via UDP fromhost<131.225.167.166:19198>6/23 11:31:34 (397335.0) (9423): DaemonCore: received command 71000(SHADOW_UPDATEINFO), calling handler (UniShadow::updateFromStarter)

6/23 11:31:34 (397335.0) (9423): Asked to write event of number 6


SchedLog


6/23 11:30:51 (pid:12870) Starting add_shadow_birthdate(397335.0)

6/23 11:30:51 (pid:12870) Started shadow for job 397335.0 on"<131.225.167.166:34031>", (shadow pid = 8778)6/23 11:30:52 (pid:12870) Match record (<131.225.167.166:34031>, 397335,0) deleted6/23 11:30:53 (pid:12870) Shadow pid 8778 for job 397335.0 exited withstatus 108

6/23 11:31:25 (pid:12870) Starting add_shadow_birthdate(397335.0)

6/23 11:31:25 (pid:12870) Started shadow for job 397335.0 on"<131.225.167.166:34031>", (shadow pid = 9423)



MatchLog:

6/23 11:30:21     Request 397335.00000:

6/23 11:30:21 Rejected 397335.0 group_e898.minboone@xxxxxxxx<131.225.167.42:33246>: insufficient priority

6/23 11:31:20     Request 397335.00000:


StartLog

6/23 00:31:11 DaemonCore: Command received via TCP from host<131.225.167.42:197

69>

6/23 00:31:11 DaemonCore: received command 404(DEACTIVATE_CLAIM_FORCIBLY), call

ing handler (command_handler)
6/23 00:31:11 vm2: Called deactivate_claim_forcibly()
6/23 00:31:11 Starter pid 28613 exited with status 0
6/23 00:31:11 vm2: State change: starter exited
6/23 00:31:11 vm2: Changing activity: Busy -> Idle

6/23 00:31:11 vm2: State change: idle claim shutting down due toCLAIM_WORKLIFE6/23 00:31:11 vm2: Changing state and activity: Claimed/Idle ->Preempting/Vacat

ing
6/23 00:31:11 vm2: State change: No preempting claim, returning to owner

6/23 00:31:11 vm2: Changing state and activity: Preempting/Vacating ->Owner/Idl

e
6/23 00:31:11 vm2: State change: IS_OWNER is false
6/23 00:31:11 vm2: Changing state: Owner -> Unclaimed
6/23 00:31:11 State change: RunBenchmarks is TRUE
6/23 00:31:11 vm2: Changing activity: Idle -> Benchmarking
6/23 00:31:16 State change: benchmarks completed
6/23 00:31:16 vm2: Changing activity: Benchmarking -> Idle

6/23 00:31:16 DaemonCore: Command received via UDP from host<131.225.167.42:377

12>

6/23 00:31:16 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler

(command_release_claim)

6/23 00:31:16 Warning: can't find resource with ClaimId(<131.225.167.166:34031>

#1150905923#17)

6/23 00:35:18 DaemonCore: Command received via UDP from host<131.225.167.47:105

11>

6/23 00:35:18 DaemonCore: received command 440 (MATCH_INFO), callinghandler (co

mmand_match_info)
6/23 00:35:18 vm2: match_info called
6/23 00:35:18 vm2: Received match <131.225.167.166:34031>#1150905923#21
6/23 00:35:18 vm2: State change: match notification protocol successful
6/23 00:35:18 vm2: Changing state: Unclaimed -> Matched

6/23 00:35:19 DaemonCore: Command received via TCP from host<131.225.167.42:200

86>

6/23 00:35:19 DaemonCore: received command 442 (REQUEST_CLAIM), callinghandler

(command_request_claim)
6/23 00:35:19 vm2: Request accepted.
6/23 00:35:19 vm2: Remote owner is minboone@xxxxxxxx
6/23 00:35:19 vm2: State change: claiming protocol successful
6/23 00:35:19 vm2: Changing state: Matched -> Claimed

6/23 00:35:25 DaemonCore: Command received via TCP from host<131.225.167.42:201

14>

6/23 00:35:25 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler

 (command_activate_claim)

6/23 00:35:25 vm2: Got activate_claim request from shadow(<131.225.167.42:20114

6/23 00:35:25 vm2: Remote job ID is 395170.0
6/23 00:35:25 vm2: Got universe "VANILLA" (5) from request classad
6/23 00:35:25 vm2: State change: claim-activation protocol successful
6/23 00:35:25 vm2: Changing activity: Idle -> Busy

6/23 11:30:48 DaemonCore: Command received via TCP from host<131.225.167.42:368

15>

6/23 11:30:48 DaemonCore: received command 404(DEACTIVATE_CLAIM_FORCIBLY), call

ing handler (command_handler)
6/23 11:30:48 vm2: Called deactivate_claim_forcibly()

6/23 11:30:51 DaemonCore: Command received via TCP from host<131.225.167.42:368

18>

6/23 11:30:51 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler

 (command_activate_claim)
6/23 11:30:51 vm2: Got activate claim while starter is still alive.
6/23 11:30:51 vm2: Telling shadow to try again later.

6/23 11:30:52 DaemonCore: Command received via TCP from host<131.225.167.42:368

21>

6/23 11:30:52 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler

 (command_activate_claim)
6/23 11:30:52 vm2: Got activate claim while starter is still alive.
6/23 11:30:52 vm2: Telling shadow to try again later.
6/23 11:30:52 Starter pid 13410 exited with status 0
6/23 11:30:52 vm2: State change: starter exited
6/23 11:30:52 vm2: Changing activity: Busy -> Idle

6/23 11:30:52 vm2: State change: idle claim shutting down due toCLAIM_WORKLIFE6/23 11:30:52 vm2: Changing state and activity: Claimed/Idle ->Preempting/Vacat

ing
6/23 11:30:52 vm2: State change: No preempting claim, returning to owner

6/23 11:30:52 vm2: Changing state and activity: Preempting/Vacating ->Owner/Idl

e
6/23 11:30:52 vm2: State change: IS_OWNER is false
6/23 11:30:52 vm2: Changing state: Owner -> Unclaimed

6/23 11:30:52 DaemonCore: Command received via UDP from host<131.225.167.42:228

44>

6/23 11:30:52 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler

(command_release_claim)

6/23 11:30:52 Warning: can't find resource with ClaimId(<131.225.167.166:34031>

#1150905923#21)

6/23 11:30:53 DaemonCore: Command received via TCP from host<131.225.167.42:368

22>

6/23 11:30:53 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler

 (command_activate_claim)

6/23 11:30:53 Error: can't find resource with ClaimId(<131.225.167.166:34031>#1

150905923#21)

6/23 11:30:53 DaemonCore: Command received via TCP from host<131.225.167.42:368

23>

6/23 11:30:53 DaemonCore: received command 404(DEACTIVATE_CLAIM_FORCIBLY), call

ing handler (command_handler)

6/23 11:30:53 Error: can't find resource with ClaimId(<131.225.167.166:34031>#1

150905923#21)

6/23 11:31:20 DaemonCore: Command received via UDP from host<131.225.167.47:106

88>

6/23 11:31:20 DaemonCore: received command 440 (MATCH_INFO), callinghandler (co

mmand_match_info)
6/23 11:31:20 vm2: match_info called
6/23 11:31:20 vm2: Received match <131.225.167.166:34031>#1150905923#25
6/23 11:31:20 vm2: State change: match notification protocol successful
6/23 11:31:20 vm2: Changing state: Unclaimed -> Matched

6/23 11:31:21 DaemonCore: Command received via TCP from host<131.225.167.42:369

31>

6/23 11:31:21 DaemonCore: received command 442 (REQUEST_CLAIM), callinghandler

(command_request_claim)
6/23 11:31:21 vm2: Request accepted.
6/23 11:31:21 vm2: Remote owner is minboone@xxxxxxxx
6/23 11:31:21 vm2: State change: claiming protocol successful
6/23 11:31:21 vm2: Changing state: Matched -> Claimed

6/23 11:31:25 DaemonCore: Command received via TCP from host<131.225.167.42:369

59>

6/23 11:31:25 DaemonCore: received command 444 (ACTIVATE_CLAIM), callinghandler

 (command_activate_claim)

6/23 11:31:25 vm2: Got activate_claim request from shadow(<131.225.167.42:36959

6/23 11:31:25 vm2: Remote job ID is 397335.0
6/23 11:31:26 vm2: Got universe "VANILLA" (5) from request classad
6/23 11:31:26 vm2: State change: claim-activation protocol successful
6/23 11:31:26 vm2: Changing activity: Idle -> Busy




StarterLog.vm2:

6/23 00:35:25 ******************************************************
6/23 00:35:25 ** condor_starter (CONDOR_STARTER) STARTING UP
6/23 00:35:25 ** /opt/condor-6.7.18/sbin/condor_starter
6/23 00:35:25 ** $CondorVersion: 6.7.18 Mar 22 2006 $
6/23 00:35:25 ** $CondorPlatform: I386-LINUX_RH9 $
6/23 00:35:25 ** PID = 13410
6/23 00:35:25 ******************************************************
6/23 00:35:25 Using config file: /etc/condor/condor_config

6/23 00:35:25 Using local config files:/opt/condor/etc/group_params.config /opt

/condor/local/condor_config.local
6/23 00:35:25 DaemonCore: Command Socket at <131.225.167.166:36256>
6/23 00:35:25 Done setting resource limits
6/23 00:35:25 Communicating with shadow <131.225.167.42:20111>
6/23 00:35:25 Submitting machine is "fngp-osg.fnal.gov"
6/23 00:35:25 Starting a VANILLA universe job with ID: 395170.0
6/23 00:35:25 IWD: /home/e898dev

6/23 00:35:25 Output file:/home/minboone/.globus/job/fngp-osg.fnal.gov/30219.11

50987622/stdout

6/23 00:35:25 Error file:/home/minboone/.globus/job/fngp-osg.fnal.gov/30219.115

0987622/stderr

6/23 00:35:25 About to exec/farm/e898_stage0/tmp/polly/may06_pid_cocktail_0313/

input/may06_pid_cocktail_0313.sh condor_exec.exe
6/23 00:35:25 Create_Process succeeded, pid=13413
6/23 11:30:48 Process exited, pid=13413, status=0
6/23 11:30:48 Got SIGQUIT.  Performing fast shutdown.
6/23 11:30:48 ShutdownFast all jobs.
6/23 11:30:52 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
6/23 11:31:26 ******************************************************
6/23 11:31:26 ** condor_starter (CONDOR_STARTER) STARTING UP
6/23 11:31:26 ** /opt/condor-6.7.18/sbin/condor_starter
6/23 11:31:26 ** $CondorVersion: 6.7.18 Mar 22 2006 $
6/23 11:31:26 ** $CondorPlatform: I386-LINUX_RH9 $
6/23 11:31:26 ** PID = 2899
6/23 11:31:26 ******************************************************
6/23 11:31:26 Using config file: /etc/condor/condor_config

6/23 11:31:26 Using local config files:/opt/condor/etc/group_params.config /opt

/condor/local/condor_config.local
6/23 11:31:26 DaemonCore: Command Socket at <131.225.167.166:36841>
6/23 11:31:26 Done setting resource limits
6/23 11:31:26 Communicating with shadow <131.225.167.42:36958>
6/23 11:31:26 Submitting machine is "fngp-osg.fnal.gov"
6/23 11:31:26 Starting a VANILLA universe job with ID: 397335.0
6/23 11:31:26 IWD: /home/e898dev

6/23 11:31:26 Output file:/home/minboone/.globus/job/fngp-osg.fnal.gov/30591.11

51066691/stdout

6/23 11:31:26 Error file:/home/minboone/.globus/job/fngp-osg.fnal.gov/30591.115

1066691/stderr

6/23 11:31:26 About to exec/farm/e898_stage0/tmp/polly/may06_cocktail_dirt_0699

/input/may06_cocktail_dirt_0699.sh condor_exec.exe
6/23 11:31:26 Create_Process succeeded, pid=2900


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

Prev by Date: Re: [Condor-users] Condor 6.7.20 Released!
Next by Date: Re: [Condor-users] Digest Mode
Previous by thread: [Condor-users] Condor, Condor-G and Hawkeye
Next by thread: [Condor-users] java output problem
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Jobs DELAYED/REFUSED/EVICTED