[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and GPUs



On Thu, Jul 3, 2008 at 5:32 AM, Steffen Grunewald
<steffen.grunewald@xxxxxxxxxx> wrote:
> On Wed, Jul 02, 2008 at 10:33:46AM -0400, Fr?d?ric Bastien wrote:
...

>> I had trouble in doing this. The trouble with this is that
>> slotX_ImageSize is not an up to date version of the value in each
>> slot. In one case when I did "condor_status -l hostname|grep
>> ImageSize", I get:
>>
>> [for slot1]
>> ImageSize = 1588508
>> MonitorSelfImageSize = 30104.000000
>> slot1_ImageSize = 1588508
>> slot2_ImageSize = 1588508
>> [for slot2]
>> ImageSize = 29500
>> MonitorSelfImageSize = 30104.000000
>> slot1_ImageSize = 0
>> slot2_ImageSize = 29500
>>
>> If condor_status return the same value as the one seen while doing the
>> negotiation, this will not work correctly as all slot don't see the
>> same version. If I wait 5 minutes the value are correct.
>
> Yes, it takes another negotiation cycle for the master to notice. If
> more jobs are matched before this cycle ends, your changed rules won't
> be followed.

I'm not sure to understand well. When the master make match, did it
have the good value? If yes, we could use it event if condor_status
don't show the good value. I think condor can match some jobs between
negotiation cycle; when a job finish it match one right away. But I
don't know if it use the good information at that time. Do some one
know?

>
>> An ideal world, each SMP have a pool of available resource. When a job
>> have all needed resource, it create a slot and it execute it their.
>> One way of doing it without too much modification to condor is to
>> generate in advance the slot(ex: one for each cpu) and allow their
>> execution only when the pool of resource have what is needed.
>
> Exactly. Pre-defined static slots have to be replaced by something that's
> aware of the whole picture (sees the *machine* that hosts the slots which
> can be dynamically reconfigured, or even created on demand).
>
>> This is the two current limitation that I see. The first one is the
>> priority for me. Or meaby you could add variable like
>> pool_TotalMemoryUsed, that use the up to date version or you could
>> hardcode TotalMemoryUsed. If you hardcode it, this won't solve the
>> issue with custom ressource.
>
> Speaking of resources:
> Since I have found that users rarely use the memory requirements they
> advertised in their submit file, I have added about 20% of the available
> swap space to the "real memory" to allow for memory overcommit. Up to now
> this has proven safe enough (we don't have forced job termination in
> place); not all virtual memory of the applications run on our pool would
> be accessed all the time. We've seen resident/virtual ratios of up to
> 80%, often a lot smaller.
> I'd like not to lose this opportunity (negative RESERVED_MEMORY)...
>
> Of course, with dynamic slot creation, another problem comes along:
> If a machine is already partially taken, how to define a ranking among
> machines to allow for maximum flexibility in the future?
> Imagine a pool consisting of 100 2-core machines, for simplicity.
> If user A submits 100 jobs each requiring half the total RAM of a machine
> (as in "old style" default slots): if we match them against the first
> slot of each machine, user B, who submits some "big" jobs (taking almost
> all RAM) wouldn't get matched, and the pool would run at 50% efficiency.
> This is not as bad as it sounds since currently the only way to get B's
> jobs run without risking (total) memory overcommit would be to set aside
> a number of 1-slot machines (wasting a CPU core). (Otherwise user B
> would have to lie about her memory requirements... and risk heavy swapping.)

What happen is that I see two step in the implementation of a good
handgling of SMP machine: 1) allow to used them correctly, 2) (as you
tell) optimize their usage. I thinks that when step 1 will be done, we
would be able to use the current method for making rules for step 2.
At least initially we could use it with not too bad result.


Frédéric Bastien