[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Questions/Comments on dynamic slot for SMP computer



On 12/10/2009 12:28 PM, Frédéric Bastien wrote:
> Hi,
> 
> I just found recently about the dynamic slot in condor. This is
> something that I wanted for some time and I'm happy to find it. Also,
> the current limitation that it have a high probability to starve jobs
> that ask for more ressource can be easily avoided if you put only a port
> of the pool as partionable. You can find more information at
> http://www.cs.wisc.edu/condor/manual/v7.4/3_13Setting_Up.html#SECTION004139900000000000000

That concern isn't really a high probability. You can always get starvation, always. Avoidance depends on your workload and knowledge of it.


> My questions first then my comment:
> 1) After 15 minutes that a job run in a dynamic slot, the SIZE column of
> condor_q get updated to the size used by the jobs. I have in my
> configuration file this:
> STARTER_UPDATE_INTERVAL=60
> TOUCH_LOG_INTERVAL         = 60
> MASTER_UPDATE_INTERVAL     = 60
> UPDATE_INTERVAL            = 60
> SCHEDD_INTERVAL            = 60
> 
> What else do I need to have it updated more frequently as each minutes
> or 5 minutes?

You might need to set SHADOW_QUEUE_UPDATE_INTERVAL, discussed:

http://spinningmatt.wordpress.com/2009/04/11/publishing-rates-in-a-condor-pool/


> 2) I send a job on a dynamic slot. After 15 minutes, condor_q -l give
> ImageSize_RAW = 216592
> ImageSize = 220000
> 
> but top give 100M used, 105M virtual. Why their is such a big
> difference? In another case I have:
> ImageSize_RAW = 626192
> ImageSize = 700000
> 
> but top give 500M used, 505M virtual. So their seam to be  around ~100M
> difference. Where this could come from? Can I do something about this?

This actually has nothing to do with partitionable or dynamic slots. It's just memory accounting in general.

It might be that your job has children who are also using memory, you can often find those children with pstree.


> Here is a few addition to make the doc at the gived link more usefull.
> All this can be found by experimentaiton, but this take time and will
> same time to other person who would like to use it too.
> 1) give the unit of request_memory (Meg)

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1050


> 2) tell about DynamicSlot and PartitionableSlot. can link to their
> definition elsewhere.

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1051


> 3) tell that request_memory won't affect non partionalble slot(so we
> need to put it in requirements too). Could this be done automatically as
> to only need to set request_* and not change the requirements?

Your expectation is that if you specify request_memory that your job should never run on a slot with less than that amount of memory, be it partitionable or not?


> 4) what if request_cpus is not set? Default to 1?

Yes, defaults to 1.

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1053


> 5) Tell that if a job use more memory then what was requested, we will
> only remove the amount requestion from the partionable slot. It would be
> better that we remove the max of the two as user and bug can cause swap
> and this would make it less trouble some as a swapping compute host
> won't start new jobs.

So the memory given to a dynamic slot does not change after its creation, as you hvae noticed. If your job is going over its request and you want to kick the job into a large slot you can use Startd policy to kick the job and a request_memory that is smart enough to notice that the ImageSize has grown larger than the original request, e.g. request_memory = max(<your initial request>, ImageSize/1024.0). The default request_memory expression actually does something similar.


> Thanks for all your work for this feature. I still have to upgrade my
> main condor pool to be able to use it. But I should do this shortly.

Glad it you like it. Thank you for the feedback.


Best,


matt