[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tracking available memory on a compute host



Hi all,

maybe a bit blunt, but would it be possible to update Condor's cgroup
main slice of memory
  echo "${NEWLIMIT}" >
/sys/fs/cgroup/memory/system.slice/condor.service/{memory,memsw}.limit_in_bytes
(would be outside of the Codnor context & I have no idea, how/if Condor
would take note of changed cgroup values...??)

Cheers,
  Thomas


On 2018-01-31 22:37, Steve Huston wrote:
> On Wed, Jan 31, 2018 at 12:43 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
>> What you can't do is tell HTCondor that it can have all of the memory and also let some other scheduler use all the memory
>> and expect HTCondor to dynamically adjust its allocations to account for non-HTCondor memory usage.
> 
> I thought that was the whole point?
> 
> "All machines in the HTCondor pool advertise their resource
> properties, both static and dynamic, such as *available RAM memory*,
> CPU type, CPU speed, virtual memory size, physical location, and
> current load average, in a resource offer ad." --
> http://research.cs.wisc.edu/htcondor/manual/current/1_2HTCondor_s_Power.html
> (emphasis mine)
> 
> Of course I could restrict the memory allowed for Condor, and I could
> probably with the right settings restrict the available memory for
> console (owner) usage to something so that Condor jobs always have
> resources.  But just like a CPU core can be used by the owner and then
> free for Condor usage later, I would think RAM should be as well.
> 
> On Wed, Jan 31, 2018 at 3:29 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
>> You could probably do something using a startd cron script to push a value
>> into the slot ads the represents the amount of non-HTCondor memory usage,
>> and then have the START expression refer to that value in order to prevent
>> matches.   There will be some delay between when the startd sees the updated
>> value for non-HTCondor usage and when the Negotiator and Schedd see that
>> value â so you will still probably get some jobs starting that then just OOM
>> killed a little while later, but it wonât *keep* happening.
> 
> I suppose that's the route I'll have to take, if this becomes
> problematic enough.  So far it hasn't before, and I've been running a
> Condor scheduler here for just over 13 years, so it might not be worth
> the hassle.  I was just confused that it didn't already exist, and
> figured I was overlooking something simple.
> 
> Thanks all.
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature