[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic memory for SMP



> I have added some rules to have condor behave in a better way on smp
> machine(8 slots of 1 cores, 1G ram):
> STARTD_JOB_EXPRS        = $(STARTD_JOB_EXPRS), ImageSize
> TotalMemoryUsed               = ( 0 + slot1_ImageSize +
> slot2_ImageSize + slot3_ImageSize + slot4_ImageSize + slot5_ImageSize
> + slot6_ImageSize + slot7_ImageSize + slot8_ImageSize )
> START = $(START) && TotalMemoryUsed < TotalMemory
> 
> We need this as sometimes we have jobs that need more then 1G 
> or ram and if we let it fill the computer, it will trash too much.
> 
> What the rules make is that if the current jobs use more then 
> the TotalMemory of the computer, it won't start new jobs 
> event if slot are available. This limit the trashing on the server.
> 
> But I have one trouble, if one such jobs get killed, it won't 
> restart as the requiment "((Memory * 1024) >= ImageSize)" is 
> false. This requirement is not in the submit file, so I 
> suppose condor add it as some others. What I would like is to 
> replace it by
> "(((TotalMemory-TotalMemoryUsed) * 1024) >= ImageSize)". So 
> those jobs that are killed can be restarted.
> 
> Is their a way to do it?

Cool idea. Let us know how it works out in practice for you.

Condor needs a reference to TARGET.Memory to appear in the requirements
expression else it inserts its own rule. I define my own image size
estimate in my job submissions called AlteraImageSize, which is static
and doesn't get updated by Condor at job runtime, and then simply
require that the target machine's memory be greater or equal to that
value instead of ImageSize. Submit ticket looks like this:

+AlteraImageSize = 10000
requirements = (TARGET.Memory >= AlteraImageSize) 

I do the same for disk space requirements as well.

This ensures that a preempted job, regardless of how long its been
running, uses the same disk and memory estimates re-negotiating and not
something extremely low (and comletey false) because it had run for only
a small amount of time. In your case this expression might need to be
more complicated so a job that gets booted (do they get booted) that's
using more than its original ImageSize estimate doesn't re-negotiate
with a lower-than-seen requirement.

> p.s. I know the rules I added have a trouble. If the server 
> is empty and the user don't specifie an ImageSize, we will 
> start 8 jobs. To have it work correctly when the server is 
> empty we must have the user estimate the ImageSize needed.

I personally think this is the correct way: user estimates stay and
should not be changed by Condor and then you can use the user estimate
with the Condor calculated value in ImageSize to make better scheduling
decisions should a job get vacated and have an ImageSize that possibly
doesn't reflect a good upper bound estimate from the job.

- Ian


Confidentiality Notice.  This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, 
or copying of this message, or any attachments, is strictly prohibited.  If you have received this message in error, 
please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.