[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and processor affinity



Matt,

Is there some way for Condor to detect that a process is going to use a
job object itself? My understanding is that if Condor spawns a job in a
job object, then that job cannot use a job object for something it spawns.

The multiple code paths would have to be in Condor, and it would either
have to not use job objects and provide a base set of features or use
job objects and reject any jobs that also try to use job objects.

Any idea how many of your programs Condor wouldn't be able to run
because they use job objects?

Best,


matt

Matt Hope wrote:
> Sorry for the late reply I have been off list for a while.
> 
> I suspect the number of people using job objects on windows is
> relatively low and any that are are sufficiently savvy to spot issues
> with condor conflicting with theirs.
> 
> Having condor default to use job objects on windows to control
> everything (optionally enforcing the memory and cpu limiting) would
> make the common code path much simpler (and more robust). You get: *
> Hassle of spotting/counting multiple processes is completely gone *
> Free killing of everything associated with the slot * memory/cpu
> counting for total effort * PriorityClass/SchedulingClass handles the
> renice * Memory can be limited rather than killed after exceeding *
> CPU affinity * a fork bomb style problem can be prevented
> 
> However because some people may want total control over the job
> objects you will end up needing two code paths, reducing the
> simplicity considerably (and making testing much harder).
> 
> I'd argue that it is worth doing if there are few/no users making use
> of job objects currently who would not have all their needs served by
> simply using the integrated condor functionality.
> 
> One option is to eliminate the need for two code paths by making the
> job object path the only supported way to achieve significant control
> and have the other route rely on simple mechanisms like the per slot
> user to function. Thus people willing to take control themselves on a
> per job basis can do as they wish without significantly impacting the
> code complexity.
> 
> It's a balancing act between your users and your developers, from my
> point of view having it optionally integrated into condor would be
> excellent since I could let the configuration of slot based cpu
> affinity and memory be centrally controlled rather than each job
> having to 'work it out on the fly' as it currently does.
> 
> Also it would allow the possibility of things like:
> 
> Dynamic controls where by a minimum number of cores are always
> available but, if the machine as a whole is partly unused those cores
> can be dynamically assigned to the existing slots then taken away
> again.
> 
> Ganging together multiple slots so that you could take ownership of
> (say) 2 and get access to all their resources with an eviction of one
> allowing either a reduction in your resources or an evict from all of
> them.
> 
> In relation to the above using NUMA specific information to group
> slots/cores intelligently, so ganged jobs would be cleanly migrated
> onto a shared node where possible (moving the other jobs about as
> needed to achieve this)
> 
> IO related aspects are not currently present but are reserved for
> future use, this could be a route to IO throttling or disk usage
> limits (very speculative)
> 
> All that said if you have existing users using jobs in alternate ways
> you may just want to accept the simple use what already exists and
> work route.
> 
> Matt
> 
> -----Original Message----- From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew
> Farrellee Sent: 13 July 2009 14:48 To: Ben Burnett Cc: Condor-Users
> Mail List Subject: Re: [Condor-users] Condor and processor affinity
> 
> (added condor-users back, it had been dropped)
> 
> This is all good to know. Maybe Matt Hope has some other thoughts on 
> managing processes on Windows.
> 
> Best,
> 
> 
> matt
> 
> Ben Burnett wrote:
>> Yes, we have considered using Windows "Job" objects, but because,
>> as Matt pointed out, a process can only be a member of one Windows
>> Job object. Thus, if Condor used Job objects, no Condor jobs could
>> use processes that used Windows Job objects as the Condor jobs
>> would fail to execute (being unable to add themselves to another
>> Windows Job object).
>> 
>> My understanding is that we have also spoken to MS about this in
>> the past, and that while they have agreed that nested Job objects,
>> or processes in more than one Job would be cool, that there  has
>> been little call or interest in it.  Maybe it is time to try
>> talking to them again, given the large rise in multi-core and cpu
>> machines? (Since it seems to be a way to manage a group of process
>> on one cpu [or core].)
>> 
>> Assuming child process don't inherit processor affinity, we could
>> easily fake the inheritance by receiving callbacks each time a new
>> child process is created.  If its parent has its processor affinity
>> set, and we recognize it (process family code), then set the
>> child's affinity as well.
>> 
>> Regards, -B
>> 
>> On Thu, July 9, 2009 3:10 pm, Matthew Farrellee wrote:
>>> Todd, Ben,
>>> 
>>> Have you considered such an interface? Does it have the nesting
>>> problem of Job objects within Job objects?
>>> 
>>> Best,
>>> 
>>> 
>>> matt
>>> 
>>> Matt Hope wrote:
>>>> As an aside on windows are you using Job objects to control
>>>> this or simply setting the Affinity of the first launched
>>>> process?
>>>> 
>>>> Using job objects sorts the child processes problem but might
>>>> not play nicely with peoples usage of this (a process can only
>>>> be a member of one job and cannot stop being a member of a job
>>>> once attached to it) so it would be nice to know how this is
>>>> enforced.
>>>> 
>>>> Here's some .Net code to do this yourself (as well as control
>>>> your memory) in case anyone else finds it useful Translating to
>>>> raw win32 is pretty trivial
>>>> 
>>>> You could extend this on windows to do many of the
>>>> complex/error prone things condor currently has to do by hand
>>>> such as killing all child processes within a job, determing cpu
>>>> time of the job as a whole, the peak memory usage etc...

[snip]