[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Questions/Comments on dynamic slot for SMP computer



Hi,

On Fri, Dec 11, 2009 at 6:57 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
> What else do I need to have it updated more frequently as each minutes
> or 5 minutes?

You might need to set SHADOW_QUEUE_UPDATE_INTERVAL, discussed:

http://spinningmatt.wordpress.com/2009/04/11/publishing-rates-in-a-condor-pool/

Thanks its work. I subscribed to your blog too.
 


> 2) I send a job on a dynamic slot. After 15 minutes, condor_q -l give
> ImageSize_RAW = 216592
> ImageSize = 220000
>
> but top give 100M used, 105M virtual. Why their is such a big
> difference? In another case I have:
> ImageSize_RAW = 626192
> ImageSize = 700000
>
> but top give 500M used, 505M virtual. So their seam to be  around ~100M
> difference. Where this could come from? Can I do something about this?

This actually has nothing to do with partitionable or dynamic slots. It's just memory accounting in general.

It might be that your job has children who are also using memory, you can often find those children with pstree.

This is the result from pstree

     |               `-condor_startd-+-condor_procd
     |                               `-condor_starter---condor_exec.exe---launch.sh2.sh---memhog
 
my process are launch.sh2.sh and memhog. launch.sh2.sh is very small and memhog take the amout that was told(105M or 505M) The additional ~100M seam to come from condor_stater and condor_exec.exe(maybe also condor_startd). I personaly think that we should not count them as together they took much space that are in the virtual memory space, but not needed in ram. Just my personal view. Thanks for the explanation.


> 3) tell that request_memory won't affect non partionalble slot(so we
> need to put it in requirements too). Could this be done automatically as
> to only need to set request_* and not change the requirements?

Your expectation is that if you specify request_memory that your job should never run on a slot with less than that amount of memory, be it partitionable or not?

yes. As a workaround my interpretation, I can put my requirement in request_memory and Requirement to make it work. I think it would be better to change the behavior or tell about it in the doc so that people won't be surprised.
 

> 5) Tell that if a job use more memory then what was requested, we will
> only remove the amount requestion from the partionable slot. It would be
> better that we remove the max of the two as user and bug can cause swap
> and this would make it less trouble some as a swapping compute host
> won't start new jobs.

So the memory given to a dynamic slot does not change after its creation, as you hvae noticed. If your job is going over its request and you want to kick the job into a large slot you can use Startd policy to kick the job and a request_memory that is smart enough to notice that the ImageSize has grown larger than the original request, e.g. request_memory = max(<your initial request>, ImageSize/1024.0). The default request_memory _expression_ actually does something similar.

I don't like killing jobs when not needed. If their is ressource available in the partitionable slot in my ideal world, we would simple give more ressource to the current dynamic slot. In case of a memory leak in a library that we use(it happened to us many times), having a job killed after a few day is not fun. So we don't automatically kill jobs. We checkpoint jobs ourself sometime, but we don't do it often(1-2 times a day) to don't overload the fileserver. So lossing in average 12 or 6 hours for hundreads of jobs is not fun when we have a deadline. But I will try the max in case the job got killed for other reason. Thanks for the comments.

One other thing I found and is probably a bug somewhere and not an undocumented feature. For requesting multiple cpu, as for memory I must do something to make a submit file work with partitionable slot and normal slot. So I do something like:

Requirements = ... && (target.CPUS>=2)&&(Memory>=500) && ...
request_cpus = 2
request_memory=max(500,Imagesize/1024.)

That work for the memory part(not tried the max yet), but for the cpu, it don't work. The reason is that in the partitionable slot, I always end up with Cpus=1. Here is some config I tried:

NUM_SLOTS=1
NUM_SLOTS_TYPE_1  =  1
SLOT_TYPE_1_PARTITIONABLE = True

or

SLOTS_TYPE_1  =  100%
NUM_SLOTS_TYPE_1  =  1
SLOT_TYPE_1_PARTITIONABLE = True

or

SLOTS_TYPE_1  =  cpu=$(TotalCpus),100%
NUM_SLOTS_TYPE_1  =  1
SLOT_TYPE_1_PARTITIONABLE = True

or

SLOTS_TYPE_1  =  cpus=4,100%
NUM_SLOTS_TYPE_1  =  1
SLOT_TYPE_1_PARTITIONABLE = True

to have 1 partionable slot with 100% of each resource. condor_status -l |grep cpu always give
...
Cpus = 1
TotalCpus = 4
...

Do you know what I can do about this?

Also, I think that one common case will be to make a full machine partitionable. Why not make a config like

ALL_PARTITIONABLE = True

to do that. It is in the same spirit as NUM_SLOTS to make it easy some common configuration.


thanks for your anwser and your time.

Frédéric Bastien