Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Suspend and resume jobs by on demand
- Date: Thu, 19 Jul 2007 15:48:14 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Suspend and resume jobs by on demand
Here is an example of a configuration that suspends jobs on one batch
slot while the other batch slot is busy. It is based off of a working
configuration, but it is untested in its specific form below. It is
designed to work on a one-cpu system, using the new 6.9.3 "slot"
terminology in place of the old "vm" terminology. It can be extended to
SMP machines in a fairly straightforward way, and it can be translated
into the old "vm" syntax for 6.8 Condor easily enough.
# You may want to advertise double the amount of system memory
# if you have enough virtual memory to allow the foreground job
# to consume all of memory while the suspended job gets pushed
# into swap memory. There is currently no convenient way to
# tell Condor you want to oversubscribe your memory, so you
# have to hard-code the amount of memory you want to advertise
# by uncommenting and filling in the following:
# Memory = TWICE_YOUR_SYSTEM_MEMORY
NUM_CPUS = 2
# So that the suspension slot can see the state
# of the other slot, we need to have some things
# advertised about each slot in the ClassAds of
# all the other slots on the same machine:
STARTD_SLOT_EXPRS = State, RemoteUser, CurrentRank
# For informational purposes, put IsSuspensionSlot
# in the startd ClassAd:
STARTD_ATTRS = IsSuspensionSlot
# Slot 1 is the "normal" batch slot
SLOT1_IsSuspensionSlot = False
# Slot 2 is suspends its job, rather than preempting them
SLOT2_IsSuspensionSlot = True
START = ($(SLOT1_START)) || ($(SLOT2_START))
CONTINUE = ($(SLOT1_CONTINUE)) || ($(SLOT2_CONTINUE))
PREEMPT = ($(SLOT1_PREEMPT)) || ($(SLOT2_PREEMPT))
SUSPEND = ($(SLOT1_SUSPEND)) || ($(SLOT2_SUSPEND))
# The purpose of the following expression is to prevent a
# job from starting on slot 1 if it has less priority to run
# than the job already running on slot 2, because once we let
# a job run on slot 1, the slot 2 job will be suspended.
# This expression refers to attributes that are only defined
# when requirements are being evaluated by the Negotiator:
# SubmittorPrio [sic] and RemoteUserPrio
SLOT1_HAS_PRIO = SubmittorPrio =?= UNDEFINED || \
vm2_RemoteUserPrio =?= UNDEFINED || \
SubmittorPrio < 1.2 * vm2_RemoteUserPrio || \
vm2_CurrentRank =?= UNDEFINED || \
MY.Rank > vm2_CurrentRank
# Slot 1 is a normal execution slot
SLOT1_START = SlotID == 1 && TARGET.IsSuspensionJob =!= true &&
($(SLOT1_HAS_PRIO))
SLOT1_CONTINUE = SlotID == 1 && ($(TESTINGMODE_CONTINUE))
SLOT1_PREEMPT = SlotID == 1 && ($(TESTINGMODE_PREEMPT))
SLOT1_SUSPEND = SlotID == 1 && ($(TESTINGMODE_SUSPEND))
# Slot 2 is for jobs that get suspended while slot 1 is busy
SLOT2_START = SlotID == 2 && TARGET.IsSuspensionJob =?= true
SLOT2_CONTINUE = SlotID == 2 && (slot1_State =?= "Unclaimed" ||
slot1_State =?= "Owner")
SLOT2_PREEMPT = FALSE
SLOT2_SUSPEND = SlotID == 2 && slot1_State =?= "Claimed"
To submit a suspension job, you could put something like the following
in your submit file:
+IsSuspensionJob = True
requirements = TARGET.IsSuspensionSlot
The example policy above does not prevent preemption of suspension jobs
by other suspension jobs. If you want to prevent that, you could do
something like this:
# Do not preempt suspension jobs (for up to 24 hours)
MaxJobRetirementTime = (MY.IsSuspensionSlot =?= True) * 3600 * 24
Hope that helps.
--Dan
Rick Lan wrote:
Hi all
I was wondering if someone has some experience/suggestion for this
following setup. We have Windows machines so checkpointing is not
supported. Preemption is off because we don't want loose running
progress. Is there a way to suspend running jobs (usually takes days)
to run newly submitted jobs (usually takes mins/hours) and to resume
suspended jobs once these short jobs finish?
I was thinking that I could set NUM_CPUS to double the actual number
of CPUs. Set STARTD policy in a way that when half of CPUs is running
a job, the other half can't match to a job. When short jobs comes,
either identified by accounting groups or a config variable, suspend
running jobs and run short jobs on the other half of CPUs. Is this
configuration feasible?
Thanks
Rick