[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor Eviction Problems



Hi All

We have implemented a "time of day" policy as shown in section"

3.6.9.3 Time of Day Policy 

in the online manual for version 6.6.10.

It is statede here that:

WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \
              (ClockDay > 0 && ClockDay < 6) ) 
AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \
               (ClockDay == 0 || ClockDay == 6) )

START = $(AfterHours) && $(CPUIdle) && KeyboardIdle > $(StartIdleTime)

MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )

By default, the MachineBusy macro is used to define the SUSPEND and
PREEMPT expressions. If you have changed these expressions at your site,
you will need to add $(WorkHours) to your SUSPEND and PREEMPT
expressions
as appropriate. 

Depending on your site, you might also want to avoid suspending jobs
during work hours, so that in the morning, if a job is running, it will
be immediately preempted, instead of being suspended for some length
of time: 

WANT_SUSPEND = $(AfterHours)

We seem to have MANY jobs being evicted after 30mins. See the log file
at the end of this email. Could our config be the problem?

Here is our current configuration:

CONDOR_CONFIG FILE

***********************************************************************

MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) ) 

WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \
              (ClockDay > 0 && ClockDay < 6) ) 
AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \
               (ClockDay == 0 || ClockDay == 6) )

##  The RANK expression controls which jobs this machine prefers to
##  run over others.  Some examples from the manual include:
##    RANK = TARGET.ImageSize
##    RANK = (Owner == "coltrane") + (Owner == "tyner") \
##                  + ((Owner == "garrison") * 10) + (Owner == "jones")
##  By default, RANK is always 0, meaning that all jobs have an equal
##  ranking.
#RANK			= 0


#####################################################################
##  This where you choose the configuration that you would like to
##  use.  It has no defaults so it must be defined.  We start this
##  file off with the UWCS_* policy.
######################################################################

##  Also here is what is referred to as the TESTINGMODE_*, which is
##  a quick hardwired way to test Condor.
##  Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.
##  For example:
##  WANT_SUSPEND 		= $(UWCS_WANT_SUSPEND)
##  becomes
##  WANT_SUSPEND 		= $(TESTINGMODE_WANT_SUSPEND)

WANT_SUSPEND 		= $(UWCS_WANT_SUSPEND)
#WANT_SUSPEND		= $(CSIRO_WANT_SUSPEND)
#WANT_VACATE		= $(UWCS_WANT_VACATE)
WANT_VACATE		= $(CSIRO_WANT_VACATE)
#START			= $(UWCS_START)
START			= $(CSIRO_START)
SUSPEND			= $(UWCS_SUSPEND)
#SUSPEND		= $(CSIRO_SUSPEND)
CONTINUE		= $(UWCS_CONTINUE)
#CONTINUE		= $(CSIRO_CONTINUE)
PREEMPT			= $(UWCS_PREEMPT)
#PREEMPT		= $(CSIRO_PREEMPT)
KILL			= $(UWCS_KILL)
#KILL			= $(CSIRO_KILL)
PERIODIC_CHECKPOINT	= $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS	= $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK		= $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)

#####################################################################
## This is the default CSIRO configuration.
#####################################################################

CSIRO_WANT_SUSPEND	= False
CSIRO_WANT_VACATE	= False
CSIRO_START		= $(AfterHours) && $(CPUIdle) && KeyboardIdle >
$(StartIdleTime) 
CSIRO_SUSPEND		= False
CSIRO_CONTINUE		= True
CSIRO_PREEMPT		= False
CSIRO_KILL		= False

CSIRO_NUM_CPUS		= 1

CSIRO_JOB_RENICE_INCREMENT	= 10

************************************************************************
***

EXCERPT FROM EXECUTING MACHINES SHADOW LOG

2/11 22:50:10 ******************************************************
2/11 22:50:10 ** condor_starter (CONDOR_STARTER) STARTING UP
2/11 22:50:10 ** C:\Condor\bin\condor_starter.exe
2/11 22:50:10 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/11 22:50:10 ** $CondorPlatform: INTEL-WINNT50 $
2/11 22:50:10 ** PID = 3296
2/11 22:50:10 ******************************************************
2/11 22:50:10 Using config file: C:\Condor\condor_config
2/11 22:50:10 Using local config files: C:\Condor/condor_config.local
2/11 22:50:10 DaemonCore: Command Socket at <138.194.10.128:9655>
2/11 22:50:10 Setting resource limits not implemented!
2/11 22:50:10 Starter communicating with condor_shadow
<130.155.67.83:9805>
2/11 22:50:10 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/11 22:50:16 File transfer completed successfully.
2/11 22:50:16 Starting a VANILLA universe job with ID: 5.0
2/11 22:50:16 IWD: C:\Condor/execute\dir_3296
2/11 22:50:16 Output file: C:\Condor/execute\dir_3296\D7EG9AD.log
2/11 22:50:16 Renice expr "10" evaluated to 10
2/11 22:50:16 About to exec C:\Condor\execute\dir_3296\condor_exec.exe
D7EG9AD.egs
2/11 22:50:16 Create_Process succeeded, pid=1772
2/11 23:19:41 Got SIGQUIT.  Performing fast shutdown.
2/11 23:19:41 ShutdownFast all jobs.
2/11 23:19:41 Process exited, pid=1772, status=-1073741510
2/11 23:19:41 Last process exited, now Starter is exiting
2/11 23:19:41 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
2/12 11:50:33 ******************************************************
2/12 11:50:33 ** condor_starter (CONDOR_STARTER) STARTING UP
2/12 11:50:33 ** C:\Condor\bin\condor_starter.exe
2/12 11:50:33 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/12 11:50:33 ** $CondorPlatform: INTEL-WINNT50 $
2/12 11:50:33 ** PID = 1584
2/12 11:50:33 ******************************************************
2/12 11:50:33 Using config file: C:\Condor\condor_config
2/12 11:50:33 Using local config files: C:\Condor/condor_config.local
2/12 11:50:33 DaemonCore: Command Socket at <138.194.10.128:9230>
2/12 11:50:33 Setting resource limits not implemented!
2/12 11:50:33 Starter communicating with condor_shadow
<130.155.67.83:9733>
2/12 11:50:33 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/12 11:50:40 File transfer completed successfully.
2/12 11:50:40 Starting a VANILLA universe job with ID: 6.0
2/12 11:50:40 IWD: C:\Condor/execute\dir_1584
2/12 11:50:40 Output file: C:\Condor/execute\dir_1584\D7EG9AE.log
2/12 11:50:40 Renice expr "10" evaluated to 10
2/12 11:50:40 About to exec C:\Condor\execute\dir_1584\condor_exec.exe
D7EG9AE.egs
2/12 11:50:40 Create_Process succeeded, pid=2260
2/12 12:20:06 Got SIGQUIT.  Performing fast shutdown.
2/12 12:20:06 ShutdownFast all jobs.
2/12 12:20:06 Process exited, pid=2260, status=-1073741510
2/12 12:20:07 Last process exited, now Starter is exiting
2/12 12:20:07 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0


-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining				phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC)	fax:	+61 8 6436 8555
Postal address:						mob:	0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------