[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs repeatedly evicted after 30 mins



We have the situation where a user submits ~10 jobs, 
all of which should run for ~5 hours. Many/most of 
them get repeatedly evicted after 30 mins and requeued. 
Below are the relevent logs from the submitting and execute 
machines for one particular instance.

I have tested this myself with different jobs and the eviction
is ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes.

The line in the START LOG:

3/1 05:57:16 State change: claim timed out (condor_schedd gone?)

seems to be the relevant one?

ALL of the evictions (for different execute machines and different 
jobs, same submit machine) occur at 30 minutes.

Thanks for any help.

Cheers

Greg

RELEVANT CONFIG SETTINGS?

MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) ) 

WorkHours = ( (ClockMin >= 480 && ClockMin < 1080) && \
              (ClockDay > 0 && ClockDay < 6) ) 
AfterHours = ( (ClockMin < 480 || ClockMin >= 1080) || \
               (ClockDay == 0 || ClockDay == 6) )

WANT_SUSPEND	= False
WANT_VACATE		= False
START			= $(AfterHours) && $(CPUIdle) && KeyboardIdle >
$(StartIdleTime)
SUSPEND		= $(UWCS_SUSPEND)
CONTINUE		= $(UWCS_CONTINUE)
PREEMPT		= $(UWCS_PREEMPT)
KILL			= True
PERIODIC_CHECKPOINT	= $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS	= $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK		= $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK) 
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)


SHADOW LOG FROM SUBMITTING MACHINE

3/1 08:27:18 ******************************************************
3/1 08:27:18 ** condor_shadow (CONDOR_SHADOW) STARTING UP
3/1 08:27:18 ** C:\Condor\bin\condor_shadow.exe
3/1 08:27:18 ** $CondorVersion: 6.6.10 Jun 22 2005 $
3/1 08:27:18 ** $CondorPlatform: INTEL-WINNT50 $
3/1 08:27:18 ** PID = 1148
3/1 08:27:18 ******************************************************
3/1 08:27:18 Using config file: c:\condor\condor_config
3/1 08:27:18 Using local config files: C:\Condor/condor_config.local 
3/1 08:27:18 DaemonCore: Command Socket at <130.155.67.83:9149> 
3/1 08:27:19 Initializing a VANILLA shadow 
3/1 08:27:20 (72.0) (1148): Request to run on <130.116.147.60:9836> was
ACCEPTED 
3/1 08:57:16 (72.0) (1148): Job 72.0 is being evicted 
3/1 08:57:16 (72.0) (1148): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 107

SCHEDD LOG FROM SUBMITTING MACHINE

3/1 08:27:18 Started shadow for job 72.0 on "<130.116.147.60:9836>",
(shadow pid = 1148) 
3/1 08:57:16 Shadow pid 1148 for job 72.0 exited with status 107 
3/1 08:57:16 Sent RELEASE_CLAIM to startd on <130.116.147.60:9836> 
3/1 08:57:16 Match record (<130.116.147.60:9836>, 72, 0) deleted

STARTER LOG FROM EXECUTE MACHINE

3/1 05:27:21 ******************************************************
3/1 05:27:21 ** condor_starter (CONDOR_STARTER) STARTING UP
3/1 05:27:21 ** C:\Condor\bin\condor_starter.exe
3/1 05:27:21 ** $CondorVersion: 6.6.10 Jun 22 2005 $
3/1 05:27:21 ** $CondorPlatform: INTEL-WINNT50 $
3/1 05:27:21 ** PID = 800
3/1 05:27:21 ******************************************************
3/1 05:27:21 Using config file: c:\condor\condor_config
3/1 05:27:21 Using local config files: C:\Condor/condor_config.local 
3/1 05:27:21 DaemonCore: Command Socket at <130.116.147.60:9931> 
3/1 05:27:21 Setting resource limits not implemented! 
3/1 05:27:21 Starter communicating with condor_shadow
<130.155.67.83:9149> 
3/1 05:27:21 Submitting machine is "student3-lu.minerals.csiro.au" 
3/1 05:27:36 File transfer completed successfully. 
3/1 05:27:37 Starting a VANILLA universe job with ID: 72.0 
3/1 05:27:37 IWD: C:\Condor/execute\dir_800 
3/1 05:27:37 Output file: C:\Condor/execute\dir_800\EA+mrAD.log
3/1 05:27:37 Renice expr "10" evaluated to 10
3/1 05:27:37 About to exec C:\Condor\execute\dir_800\condor_exec.exe
EA+mrAD.egs 
3/1 05:27:37 Create_Process succeeded, pid=3536 
3/1 05:57:16 Got SIGQUIT.  Performing fast shutdown. 
3/1 05:57:16 ShutdownFast all jobs. 
3/1 05:57:16 Process exited, pid=3536, status=0 
3/1 05:57:17 Last process exited, now Starter is exiting 
3/1 05:57:17 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

START LOG FROM EXECUTE MACHINE

3/1 05:27:16 DaemonCore: Command received via UDP from host
<130.116.131.60:9593> 
3/1 05:27:16 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info) 
3/1 05:27:16 match_info called 
3/1 05:27:16 Received match <130.116.147.60:9836>#2100392750 
3/1 05:27:16 State change: match notification protocol successful 
3/1 05:27:16 Changing state: Unclaimed -> Matched 
3/1 05:27:16 DaemonCore: Command received via TCP from host
<130.155.67.83:9600> 
3/1 05:27:16 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim) 
3/1 05:27:16 Request accepted. 
3/1 05:27:16 Remote owner is odw010@xxxxxxxx 
3/1 05:27:16 State change: claiming protocol successful 
3/1 05:27:16 Changing state: Matched -> Claimed 
3/1 05:27:20 DaemonCore: Command received via TCP from host
<130.155.67.83:9540> 
3/1 05:27:20 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim) 
3/1 05:27:20 Got activate_claim request from shadow
(<130.155.67.83:9540>) 
3/1 05:27:20 Remote job ID is 72.0 
3/1 05:27:21 Got universe "VANILLA" (5) from request classad 
3/1 05:27:21 State change: claim-activation protocol successful 
3/1 05:27:21 Changing activity: Idle -> Busy 
3/1 05:57:16 State change: claim timed out (condor_schedd gone?) 
3/1 05:57:16 Changing state and activity: Claimed/Busy ->
Preempting/Killing 
3/1 05:57:17 DaemonCore: Command received via TCP from host
<130.155.67.83:9584> 
3/1 05:57:17 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 
3/1 05:57:17 Got KILL_FRGN_JOB while in Preempting state, ignoring. 
3/1 05:57:17 DaemonCore: Command received via UDP from host
<130.116.147.60:9482> 
3/1 05:57:17 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand()) 
3/1 05:57:17 Starter pid 800 exited with status 0 
3/1 05:57:17 State change: starter exited 
3/1 05:57:17 State change: No preempting claim, returning to owner 
3/1 05:57:17 Changing state and activity: Preempting/Killing ->
Owner/Idle 
3/1 05:57:17 State change: IS_OWNER is false 
3/1 05:57:17 Changing state: Owner -> Unclaimed

-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining				phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC)	fax:	+61 8 6436 8555
Postal address:						mob:	0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------