[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] how to change requested memory (cpus) for running job



On 1/25/2017 12:20 PM, Harald van Pee wrote:
Dear all,

does no answer mean that there is no expert around these days or
is it just not possible with htcondor to change any ClassAdds for
a running job?


It means that no obvious way comes to mind to do what you originally asked for, which is to edit the amount of memory allocated to a startd slot that is currently running a job.

Having said that, thanks to Classads and policy expressions, HTCondor is pretty flexible, so maybe we can do some creative brain storming and get something close enough to meet your needs.

First off, seems that if jobs cannot accurately state the resources (like memory) they need up front, you have two choices: 1. be conservative and reserve more resources than the job asked for, at the cost of lower machine utilization, or 2. be optimistic and hope you can give more resources on-the-fly to a job that uses more than it requested, and then start killing jobs if you exhaust a resource.

With your suggestion below, what do you want to happen if there is any more available memory on the machine? Let the long-running job use whatever it wants and kill off short running jobs? If so that is an instance of choice #2 above.

So perhaps we can get something close to choice #2. One could likely craft a startd policy that prioritizes running long-running jobs which will request their worst-case memory usage, and then "backfills" the machine with short-running jobs that can use any memory leftover not being utilized by the long running job. The idea is to 1. tell the startd it has double the resources (RAM/Cpus) then the machine really has, then 2. configure slot1 as a partitionable slot that will only accept long running jobs which will never be preempted, and then 3. configure slots 2 and above as "backfill" static slots that only accept short jobs if there are free resource leftover from the long-running jobs, and will also preempt short jobs if the memory usage of the long-running jobs increases.

So the idea is a server with 16 cores and 256 GB of RAM would have one partitionable slot with 16cores/256GB RAM reserved for non-preemptable (long running) jobs as slot 1, and then slots 2 - 17 would be static slots each have 1core/16GB RAM (or 2cores/32GB, whatever) that would backfill however many cores/memory is being left unused non-preemptable slot1 jobs.

If I took the time to writeup a startd policy like the above, is it something you think you'd want to use?

regards,
Todd


The idea is just to change the reserved memory in a way that the available
memory decreases that no other job with big memory request can start which
could crash the machine or a long running job. The available memory should not
go to 0 if there is enough memory available and
the available memory should just inrease again if the job finish.
Therefore a reread of the reservedMemory ClasAdd on the start machine,
without killing any job,
seems to be perfect, if possible.

We are working on checkpointing of our jobs, but for some it seems not
possible.

Any ideas would be welcome

Harald

On Monday 23 January 2017 16:21:00 Harald van Pee wrote:
Hi Jason,

yes its condor_qedit not qalter. qalter works for pbs/torque even
for a running job, condor_qedit just change RequestMemory but does not
change any reservation for a running job.

Harald

On Monday 23 January 2017 16:11:21 Jason Patton wrote:
Oh, I just noticed the disclaimer about *running* jobs. Not sure about
changing the ClassAd of running jobs.

Jason Patton

On Mon, Jan 23, 2017 at 9:09 AM, Jason Patton <jpatton@xxxxxxxxxxx> wrote:
Harald,

Yes! Check out condor_qedit: http://research.cs.wisc.edu/
htcondor/manual/v8.4/condor_qedit.html

Jason Patton

On Mon, Jan 23, 2017 at 9:04 AM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>

wrote:
Hi all,

is it possible to change the reserved memory for a running job?

The problem is, we have a cluster with very long running jobs (8 weeks
in average) in a vanilla universe. We never kill any job
automatically.

Now it can happen that
a user reserves 60GB for his job and finds out that it will need 120GB
after
one week of running. Most often this will be no problem because there
is enough memory available.
But it would be a problem if another job starts and requests another
60GB. This we could avoid if at least the adminstrator can just change
the RequestMemory to 120GB.
With qualter this is possible for a idle job in the queue, but what
can I do
for a running job?

Any suggestions?

We use condor 8.4.10.

Best regards
Harald


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685