[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tracking available memory on a compute host



You could probably do something using a startd cron script to push a value into the slot ads the represents the amount of non-HTCondor memory usage, and then have the START _expression_ refer to that value in order to prevent matches.   There will be some delay between when the startd sees the updated value for non-HTCondor usage and when the Negotiator and Schedd see that value â so you will still probably get some jobs starting that then just OOM killed a little while later, but it wonât *keep* happening.

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ivo
Sent: Wednesday, January 31, 2018 1:46 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Tracking available memory on a compute host

 

Well, I'll give it a try. Maybe you can do some magic using the START _expression_ on nodes?

I've fiddled with it some time ago, to make sure some nodes never got a job if they had less than some amount of RAM (free memory could change dinamically). I know it's a different story, but I think it can be adapted to suit your needs.

I'm not at work now, but I'll check it tomorrow, maybe it helps.

Ivo

 

Em qua, 31 de jan de 2018 15:43, John M Knoeller <johnkn@xxxxxxxxxxx> escreveu:

It's fairly trivial to setup HTCondor on a node and tell it has access to only a fraction of the memory on the machine.

You absolutely can configure HTCondor to only run 4GB jobs on a 64 GB machine,  just add this line to your configuration

MEMORY = 4096

Or for a 64 Gig machine, you can do this instead.

RESERVED_MEMORY = 60000

Of course you have to change your configuration *before you start HTCondor* for this to have any effect.

What you can't do is tell HTCondor that it can have all of the memory and also let some other scheduler use all the memory
and expect HTCondor to dynamically adjust its allocations to account for non-HTCondor memory usage.

The key word here is *dynamic*

You will have to change the HTCondor configuration and then restart the SCHEDD in order for HTCondor to notice that it is
now managing less memory.

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Steve Huston
Sent: Wednesday, January 31, 2018 9:37 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Tracking available memory on a compute host

So, perhaps the problem here is that my other hat is HPC sysadmin for
a number of Slurm scheduled clusters.  Yes, I know that there's things
one could do in the kernel, but the point is that if a host with 64GB
of RAM has 62GB used, there doesn't appear to be a way for Condor to
say "I can't fit a 4GB job, so that job asking for 4GB shouldn't be
scheduled here."  Thus, the job goes there, allocates its memory, at
some point gets killed because Condor processes have a lower priority
in oom_killer's list, and then another 4GB job goes there, wash rinse
repeat.  I literally had this happen on a machine with 16GB of memory
where the local user had used about 15GB with one or two cores,
leaving at least two cores free.  Two 2GB jobs flocked there, started
running, got about 1.5GB allocated before oom_killer axed them, and
then two more jobs flocked there a minute later.

If the answer is no, that there's no way for Condor to make part of
its host classad how much free memory is currently available on a
machine, then that's all there is and I tell the users such.  But I've
got a couple rightfully wondering why a scheduler would let a job run
on a host without enough free memory at the moment only to have it get
killed.  I thought "TotalVirtualMemory" would be the answer but it
seems to only track (available_swap + total_RAM) and not
(available_swap + available_RAM) and from my looks through the
classads and documentation I don't see any value which is near to
available memory on a slot that I could have the negotiator check when
matching jobs to slots.

On Tue, Jan 30, 2018 at 3:11 PM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
> On 01/30/2018 10:58 AM, Steve Huston wrote:
>> Is there no way to have condor daemons monitor the actual available
>> memory on a host and allow classads to be matched against it to ensure
>> jobs don't flock to a host without enough free RAM?
>
> Deferred allocation model that has traditionally been one unix's big
> wins. If you don't like it, linux kernel lets you set allocation model
> to immediate and then if a process requests more RAM than is available
> now, the kernel won't start it.
>
> Of course this completely ignores the issue of swapping/thrashing, and
> the kernel's inability to always track the memory correctly, and it only
> works if every process tells the truth about its memory requirements up
> front, but you can do it. There's no need to involve condor daemons.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Steve Huston - W2SRH - Unix Sysadmin, PICSciE/CSES & Astrophysical Sci
  Princeton University  |    ICBM Address: 40.346344   -74.652242
    345 Lewis Library   |"On my ship, the Rocinante, wheeling through
  Princeton, NJ   08544 | the galaxies; headed for the heart of Cygnus,
    (267) 793-0852      | headlong into mystery."  -Rush, 'Cygnus X-1'
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/