[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tracking available memory on a compute host



So, perhaps the problem here is that my other hat is HPC sysadmin for
a number of Slurm scheduled clusters.  Yes, I know that there's things
one could do in the kernel, but the point is that if a host with 64GB
of RAM has 62GB used, there doesn't appear to be a way for Condor to
say "I can't fit a 4GB job, so that job asking for 4GB shouldn't be
scheduled here."  Thus, the job goes there, allocates its memory, at
some point gets killed because Condor processes have a lower priority
in oom_killer's list, and then another 4GB job goes there, wash rinse
repeat.  I literally had this happen on a machine with 16GB of memory
where the local user had used about 15GB with one or two cores,
leaving at least two cores free.  Two 2GB jobs flocked there, started
running, got about 1.5GB allocated before oom_killer axed them, and
then two more jobs flocked there a minute later.

If the answer is no, that there's no way for Condor to make part of
its host classad how much free memory is currently available on a
machine, then that's all there is and I tell the users such.  But I've
got a couple rightfully wondering why a scheduler would let a job run
on a host without enough free memory at the moment only to have it get
killed.  I thought "TotalVirtualMemory" would be the answer but it
seems to only track (available_swap + total_RAM) and not
(available_swap + available_RAM) and from my looks through the
classads and documentation I don't see any value which is near to
available memory on a slot that I could have the negotiator check when
matching jobs to slots.

On Tue, Jan 30, 2018 at 3:11 PM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
> On 01/30/2018 10:58 AM, Steve Huston wrote:
>> Is there no way to have condor daemons monitor the actual available
>> memory on a host and allow classads to be matched against it to ensure
>> jobs don't flock to a host without enough free RAM?
>
> Deferred allocation model that has traditionally been one unix's big
> wins. If you don't like it, linux kernel lets you set allocation model
> to immediate and then if a process requests more RAM than is available
> now, the kernel won't start it.
>
> Of course this completely ignores the issue of swapping/thrashing, and
> the kernel's inability to always track the memory correctly, and it only
> works if every process tells the truth about its memory requirements up
> front, but you can do it. There's no need to involve condor daemons.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Steve Huston - W2SRH - Unix Sysadmin, PICSciE/CSES & Astrophysical Sci
  Princeton University  |    ICBM Address: 40.346344   -74.652242
    345 Lewis Library   |"On my ship, the Rocinante, wheeling through
  Princeton, NJ   08544 | the galaxies; headed for the heart of Cygnus,
    (267) 793-0852      | headlong into mystery."  -Rush, 'Cygnus X-1'