[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and processor affinity



to avoid confusion on terminology I am calling all condor jobs 'tasks' in this email

Almost all (well over 99%) of the tasks on our farm use job objects to limit memory (and now cpu affinity) to the slot they own.
This is done via shared convention on translation of the slotid into an affinity mask and limiting everyone to just 4GB (a fortunate common value across all current machines).

If condor were to do that for us based on MEMORY and a per slot config like say SLOT1_AFFINITY_MASK then I'd happily switch off our code.

Simply marking tasks that use job objects would likely be a case of the submit classad including either a

JobWantsOwnWindowsJobObject = true

value or allowing it to be inferred it from the requirements by something like 

Requirements = blah && (ALLOWS_OWN_WINDOWS_JOB_OBJECTS) and this can then be a config setting on the execute machines (so if someone really wants to lock things down to be safer then they can make sure that it is mandatory)

For condor to do the job objects properly it would have to create a separate child process, have that process create the job then that process spawn the real child process task. Attaching after the event would be a security hole. If the starter is spawned afresh for each task then it could do this itself (but it would then be part of the job which might make shut down a bit tricky, not to mention it running out of memory!)

Matt

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew Farrellee
Sent: 14 July 2009 14:09
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condor and processor affinity

Matt,

Is there some way for Condor to detect that a process is going to use a
job object itself? My understanding is that if Condor spawns a job in a
job object, then that job cannot use a job object for something it spawns.

The multiple code paths would have to be in Condor, and it would either
have to not use job objects and provide a base set of features or use
job objects and reject any jobs that also try to use job objects.

Any idea how many of your programs Condor wouldn't be able to run
because they use job objects?

Best,


matt

Matt Hope wrote:
> Sorry for the late reply I have been off list for a while.
> 
> I suspect the number of people using job objects on windows is
> relatively low and any that are are sufficiently savvy to spot issues
> with condor conflicting with theirs.
> 
> Having condor default to use job objects on windows to control
> everything (optionally enforcing the memory and cpu limiting) would
> make the common code path much simpler (and more robust). You get: *
> Hassle of spotting/counting multiple processes is completely gone *
> Free killing of everything associated with the slot * memory/cpu
> counting for total effort * PriorityClass/SchedulingClass handles the
> renice * Memory can be limited rather than killed after exceeding *
> CPU affinity * a fork bomb style problem can be prevented
> 
> However because some people may want total control over the job
> objects you will end up needing two code paths, reducing the
> simplicity considerably (and making testing much harder).
> 
> I'd argue that it is worth doing if there are few/no users making use
> of job objects currently who would not have all their needs served by
> simply using the integrated condor functionality.
> 
> One option is to eliminate the need for two code paths by making the
> job object path the only supported way to achieve significant control
> and have the other route rely on simple mechanisms like the per slot
> user to function. Thus people willing to take control themselves on a
> per job basis can do as they wish without significantly impacting the
> code complexity.
> 
> It's a balancing act between your users and your developers, from my
> point of view having it optionally integrated into condor would be
> excellent since I could let the configuration of slot based cpu
> affinity and memory be centrally controlled rather than each job
> having to 'work it out on the fly' as it currently does.
> 
> Also it would allow the possibility of things like:
> 
> Dynamic controls where by a minimum number of cores are always
> available but, if the machine as a whole is partly unused those cores
> can be dynamically assigned to the existing slots then taken away
> again.
> 
> Ganging together multiple slots so that you could take ownership of
> (say) 2 and get access to all their resources with an eviction of one
> allowing either a reduction in your resources or an evict from all of
> them.
> 
> In relation to the above using NUMA specific information to group
> slots/cores intelligently, so ganged jobs would be cleanly migrated
> onto a shared node where possible (moving the other jobs about as
> needed to achieve this)
> 
> IO related aspects are not currently present but are reserved for
> future use, this could be a route to IO throttling or disk usage
> limits (very speculative)
> 
> All that said if you have existing users using jobs in alternate ways
> you may just want to accept the simple use what already exists and
> work route.
> 
> Matt
> 
> -----Original Message----- From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew
> Farrellee Sent: 13 July 2009 14:48 To: Ben Burnett Cc: Condor-Users
> Mail List Subject: Re: [Condor-users] Condor and processor affinity
> 
> (added condor-users back, it had been dropped)
> 
> This is all good to know. Maybe Matt Hope has some other thoughts on 
> managing processes on Windows.
> 
> Best,
> 
> 
> matt
> 
> Ben Burnett wrote:
>> Yes, we have considered using Windows "Job" objects, but because,
>> as Matt pointed out, a process can only be a member of one Windows
>> Job object. Thus, if Condor used Job objects, no Condor jobs could
>> use processes that used Windows Job objects as the Condor jobs
>> would fail to execute (being unable to add themselves to another
>> Windows Job object).
>> 
>> My understanding is that we have also spoken to MS about this in
>> the past, and that while they have agreed that nested Job objects,
>> or processes in more than one Job would be cool, that there  has
>> been little call or interest in it.  Maybe it is time to try
>> talking to them again, given the large rise in multi-core and cpu
>> machines? (Since it seems to be a way to manage a group of process
>> on one cpu [or core].)
>> 
>> Assuming child process don't inherit processor affinity, we could
>> easily fake the inheritance by receiving callbacks each time a new
>> child process is created.  If its parent has its processor affinity
>> set, and we recognize it (process family code), then set the
>> child's affinity as well.
>> 
>> Regards, -B
>> 
>> On Thu, July 9, 2009 3:10 pm, Matthew Farrellee wrote:
>>> Todd, Ben,
>>> 
>>> Have you considered such an interface? Does it have the nesting
>>> problem of Job objects within Job objects?
>>> 
>>> Best,
>>> 
>>> 
>>> matt
>>> 
>>> Matt Hope wrote:
>>>> As an aside on windows are you using Job objects to control
>>>> this or simply setting the Affinity of the first launched
>>>> process?
>>>> 
>>>> Using job objects sorts the child processes problem but might
>>>> not play nicely with peoples usage of this (a process can only
>>>> be a member of one job and cannot stop being a member of a job
>>>> once attached to it) so it would be nice to know how this is
>>>> enforced.
>>>> 
>>>> Here's some .Net code to do this yourself (as well as control
>>>> your memory) in case anyone else finds it useful Translating to
>>>> raw win32 is pretty trivial
>>>> 
>>>> You could extend this on windows to do many of the
>>>> complex/error prone things condor currently has to do by hand
>>>> such as killing all child processes within a job, determing cpu
>>>> time of the job as a whole, the peak memory usage etc...

[snip]
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
----