[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Stop Vanilla jobs from eviction/restart



hi Todd, thank you for your response.

i matched all of those settings in the link that you shared, and to my surprise they are exactly the same what it needs to be to disable preemption. 

although what i notice is many jobs have been restarted several times, in effect users having to wait for a long time to get their results. here are some details of what i notice of the jobs running and config values, do you notice anything unusual? All running jobs have equal ranks as well, although not equal userprio.....

$ condor_q -constraint '(JobRunCount > 2)' | wc -l
54
$ condor_q -constraint '(JobRunCount > 4)' | wc -l
32
$ condor_q -constraint '(JobRunCount > 5)' | wc -l
32
$ condor_q -constraint '(JobRunCount > 3)' | wc -l
33
$ condor_q -constraint '(RANK == 0)'  |wc -l
65
$ condor_q -constraint '(RANK == 0.0)'  |wc -l
65


$ condor_config_val -dump | grep -e ^CLAIM -e NEGOTIATOR_CONSIDER -e NEGOTIATOR_CONSIDER -e PREEMPT -e START
CLAIM_WORKLIFE = 540
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
MAX_STARTD_LOG = 10000000
MAX_STARTER_LOG = 10000000
NEGOTIATOR_CONSIDER_PREEMPTION = False
PREEMPT = False
PREEMPTION_RANK = 0
PREEMPTION_REQUIREMENTS = False
START = (True) && ( ($(START_SINGLE_CORE_JOB)) || ($(START_WHOLE_MACHINE_JOB)) )
START_SINGLE_CORE_JOB = TARGET.RequiresWholeMachine =!= True && MY.CAN_RUN_WHOLE_MACHINE == False && $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed"
START_WHOLE_MACHINE_JOB = TARGET.RequiresWholeMachine =?= True && MY.CAN_RUN_WHOLE_MACHINE
STARTD = $(SBIN)/condor_startd
STARTD_ADDRESS_FILE = $(LOG)/.startd_address
STARTD_ATTRS = COLLECTOR_HOST_STRING
STARTD_DEBUG =
STARTD_EXPRS =  CAN_RUN_WHOLE_MACHINE
STARTD_JOB_EXPRS = ImageSize, ExecutableSize, JobUniverse, NiceUser
STARTD_LOG = $(LOG)/StartLog
STARTD_SLOT_EXPRS =  State
STARTER = $(SBIN)/condor_starter
STARTER_DEBUG = D_NODATE
STARTER_LIST = STARTER, STARTER_STANDARD
STARTER_LOCAL = $(SBIN)/condor_starter
STARTER_LOG = $(LOG)/StarterLog
STARTER_STANDARD = $(SBIN)/condor_starter.std
STARTIDLETIME = 15 * $(MINUTE)
TESTINGMODE_PREEMPT = False
TESTINGMODE_PREEMPTION_RANK = 0
TESTINGMODE_PREEMPTION_REQUIREMENTS = False
TESTINGMODE_START = True
UWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) > $(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )
UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize
UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser == True)
UWCS_START = ( (KeyboardIdle > $(StartIdleTime)) && ( $(CPUIdle) || (State != "Unclaimed" && State != "Owner")) )




On Thu, Jun 20, 2013 at 1:13 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 6/20/2013 12:46 PM, Prem Kumar wrote:
i meand this link is where i got the macros from:
https://lists.cs.wisc.edu/archive/htcondor-users/2007-November/msg00091.shtml



There is a section in the Manual that talks all about disabling preemption/eviction.  See

http://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html#SECTION00459500000000000000


Todd



On Thu, Jun 20, 2013 at 12:42 PM, Prem Kumar <prem.it.kumar@xxxxxxxxx
<mailto:prem.it.kumar@gmail.com>> wrote:

    Dear All,

    i have following macros defined for condor pool, but i still cannot
    get to stop the eviction for vanilla jobs. i referred to this link
    https://lists.cs.wisc.edu/archive/htcondor-users/pre-2004-June/msg00061.shtml

    SUSPEND = False
    PREEMPT = False
    CONTINUE = True
    WANT_SUSPEND = False
    WANT_VACATE = False

    these are dedicated rack based executions host managed centrally by
    us, so no individuals are owner of these resources.
    issue is we have users who run vanilla jobs that cannot
    be check-pointed or let us say we don't know how to checkpoint SAS &
    R jobs. any thoughts on how we can get around?

    best!




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685