[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] how to change requested memory (cpus) for running job



Hello Todd,

many thanks for your suggestions. 
I have to think about if this could be a possible solution for us.
Most of your jobs run assumed as infinite and finished regulary after around
14days up to 110 days. 
The long running jobs are mostly small (1-5GB) but around 30%  are big (up to 
150GB and even more).
The short running jobs (from a couple of hours up to 14days) are nearly 
allways small, but they are much more seldom (around 10%).

The cluster resources are allways a problem (memory more than cpus). 
Therefore the users should request for more memory but if all ask for a factor 
of 2 then half of the jobs have to wait for several weeks.
The needed memory could roughly be estimated, but the estimates are sometimes
wrong.

Best regards
Harald



On Mittwoch, 25. Januar 2017 15:00:54 CET Todd Tannenbaum wrote:
> On 1/25/2017 12:20 PM, Harald van Pee wrote:
> > Dear all,
> > 
> > does no answer mean that there is no expert around these days or
> > is it just not possible with htcondor to change any ClassAdds for
> > a running job?
> 
> It means that no obvious way comes to mind to do what you originally
> asked for, which is to edit the amount of memory allocated to a startd
> slot that is currently running a job.
> 
> Having said that, thanks to Classads and policy expressions, HTCondor is
> pretty flexible, so maybe we can do some creative brain storming and get
> something close enough to meet your needs.
> 
> First off, seems that if jobs cannot accurately state the resources
> (like memory) they need up front, you have two choices:
>    1. be conservative and reserve more resources than the job asked for,
> at the cost of lower machine utilization, or
>    2. be optimistic and hope you can give more resources on-the-fly to a
> job that uses more than it requested, and then start killing jobs if you
> exhaust a resource.
> 
> With your suggestion below, what do you want to happen if there is any
> more available memory on the machine?  Let the long-running job use
> whatever it wants and kill off short running jobs?  If so that is an
> instance of choice #2 above.
> 
> So perhaps we can get something close to choice #2. One could likely
> craft a startd policy that prioritizes running long-running jobs which
> will request their worst-case memory usage, and then "backfills" the
> machine with short-running jobs that can use any memory leftover not
> being utilized by the long running job.  The idea is to
>    1. tell the startd it has double the resources (RAM/Cpus) then the
> machine really has, then
>    2. configure slot1 as a partitionable slot that will only accept long
> running jobs which will never be preempted, and then
>    3. configure slots 2 and above as "backfill" static slots that only
> accept short jobs if there are free resource leftover from the
> long-running jobs, and will also preempt short jobs if the memory usage
> of the long-running jobs increases.
> 
> So the idea is a server with 16 cores and 256 GB of RAM would have one
> partitionable slot with 16cores/256GB RAM reserved for non-preemptable
> (long running) jobs as slot 1, and then slots 2 - 17 would be static
> slots each have 1core/16GB RAM (or 2cores/32GB, whatever) that would
> backfill however many cores/memory is being left unused non-preemptable
> slot1 jobs.
> 
> If I took the time to writeup a startd policy like the above, is it
> something you think you'd want to use?
> 
> regards,
> Todd
> 
> > The idea is just to change the reserved memory in a way that the available
> > memory decreases that no other job with big memory request can start which
> > could crash the machine or a long running job. The available memory should
> > not go to 0 if there is enough memory available and
> > the available memory should just inrease again if the job finish.
> > Therefore a reread of the reservedMemory ClasAdd on the start machine,
> > without killing any job,
> > seems to be perfect, if possible.
> > 
> > We are working on checkpointing of our jobs, but for some it seems not
> > possible.
> > 
> > Any ideas would be welcome
> > 
> > Harald
> > 
> > On Monday 23 January 2017 16:21:00 Harald van Pee wrote:
> >> Hi Jason,
> >> 
> >> yes its condor_qedit not qalter. qalter works for pbs/torque even
> >> for a running job, condor_qedit just change RequestMemory but does not
> >> change any reservation for a running job.
> >> 
> >> Harald
> >> 
> >> On Monday 23 January 2017 16:11:21 Jason Patton wrote:
> >>> Oh, I just noticed the disclaimer about *running* jobs. Not sure about
> >>> changing the ClassAd of running jobs.
> >>> 
> >>> Jason Patton
> >>> 
> >>> On Mon, Jan 23, 2017 at 9:09 AM, Jason Patton <jpatton@xxxxxxxxxxx> 
wrote:
> >>>> Harald,
> >>>> 
> >>>> Yes! Check out condor_qedit: http://research.cs.wisc.edu/
> >>>> htcondor/manual/v8.4/condor_qedit.html
> >>>> 
> >>>> Jason Patton
> >>>> 
> >>>> On Mon, Jan 23, 2017 at 9:04 AM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> >>>> 
> >>>> wrote:
> >>>>> Hi all,
> >>>>> 
> >>>>> is it possible to change the reserved memory for a running job?
> >>>>> 
> >>>>> The problem is, we have a cluster with very long running jobs (8 weeks
> >>>>> in average) in a vanilla universe. We never kill any job
> >>>>> automatically.
> >>>>> 
> >>>>> Now it can happen that
> >>>>> a user reserves 60GB for his job and finds out that it will need 120GB
> >>>>> after
> >>>>> one week of running. Most often this will be no problem because there
> >>>>> is enough memory available.
> >>>>> But it would be a problem if another job starts and requests another
> >>>>> 60GB. This we could avoid if at least the adminstrator can just change
> >>>>> the RequestMemory to 120GB.
> >>>>> With qualter this is possible for a idle job in the queue, but what
> >>>>> can I do
> >>>>> for a running job?
> >>>>> 
> >>>>> Any suggestions?
> >>>>> 
> >>>>> We use condor 8.4.10.
> >>>>> 
> >>>>> Best regards
> >>>>> Harald
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> HTCondor-users mailing list
> >>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >>>>> with a
> >>>>> subject: Unsubscribe
> >>>>> You can also unsubscribe by visiting
> >>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>>>> 
> >>>>> The archives can be found at:
> >>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/