[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and GPUs



On Fri, Jul 4, 2008 at 2:32 PM, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
>> Another question that will help me form a response to your
>> very thoughtful and useful input . What are your thoughts
>> about time limits on the execution. In other words, if a jobs
>> says "I need 4 cores and 2.5 G of memory and  7.3 M of
>> network bandwidth" will it also say "for 2 hours"? If does,
>> what are you going to do if the jobs uses more or less than 2 hours?
>
> For all the same reasons Steffen mentioned I don't think I'd want my
> users to have to try to guess how long their jobs will run for. Either
> wall clock or cycles.

Agreed. Too hard to get right.(I'd love to write something that parsed
the arguments to our jobs and did some heuristic/probabalistic
analysis on the running times as a function of the Mips/Flops/network
throughput of the machines. That is well down on my 'Never-Never' list
sadly.

Definitely dynamic multi-core is becoming a necessity. The last
machines we got we all most all dual socket single core. Last time I
looked at what would be best bang for buck if we replaced a racks work
of older kit it would have been a two slot, fore cores per slot x86
machines with 16-32 GB RAM and the very biggest single disks (non
raided) that would fit in the blades (power and rack space trump
almost all other cost/benefit issues)

A brain dump of some of the things that I feel are the most pressing
concerns/issues for us.

What would be nice is real limits on the amount of resource a job
uses, principally :

* CPU
Number of threads/processes and also which cores they allow themselves
to run on (on NUMA aware operating systems with HyperTransport based
cpus there is a considerable benefit to forcing jobs to stay on
particular cores)
* Disk
This is a right pain - and there aren't any easy solutions, User based
quota systems don't work if multi jobs from the same user run on the
same machine. Locally based disk can seriously speed up some
operations so just dumping everything onto a massive SAN isn't a good
solution.
* Memory
Again NUMA makes a big difference. Frankly now that everything we do
is 64bit capable this is a mst (most jobs will take all the machines
memory and more if they could)

"Run Away" Jobs
Jobs that, for one reason or another, seem to totally screw the
machine up. For example we seem to encounter some annoying .Net
framework class loading/Side by Side versioning issue where a job can
hang on startup but in doing so it screws all .Net framework based
apps on the machine. It remains in the state till someone spots it and
kills it.
This wastes a lot of time since the resulting jobs end up needing to
be re-queued. I'm looking at was to automate spotting these but it is
quite complex

Clean way of preventing/limiting new jobs starting on slots on a
machine marked to be restarted for some reason - this is currently a
pain for our system's people to deal with as some jobs could run for
days/weeks so none of them should start on machines which need a
restart but short running jobs are happy to 'run the risk' of being
hit by a restart and so run on the remaining slots with the
understanding that they might get kicked without warning.

We get round these by (in approximate order of usefullness)
1) Ensuring people are well behaved (and telling them off if they
aren't). Works best for threads/processes since our entry points are
consistent (just lock yourself to the same slot number as the slot you
are running on works for us at the moment

2) Windows provides job objects
http://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx these are
fantastic for limiting memory usage - just building this into all the
high memory usage jobs totally solves the memory thrashing

4) Exploit any and all domain specific differences in jobs to pre
determine certain policies (knowing which jobs will be cpu rather than
disk constrained, which jobs will be measured in minutes/hours rather
than days etc.) this enables:
4.1) Domain specific monitoring tools. So we can easily look at the
state of the pool at the moment and historical logging to spot trends
and abnormal behaviour.
4.2) Over constrain ourselves - for example restrict certain jobs to
only run on a particular slot number (it would be so much nicer to at
least be able to say only run n of these jobs for actual machine or
similar)
This works OK in a strongly homogeneous environment - as things go
multi core this isn't going to be as simple though.

5) buy bigger disks - this is a right pain in blades, it's not like
you can just add to an array. Sadly this is the most effective
solution to the problem, if not the most scalable since the desire for
more and more data in our arena never goes away.

There is no easy way to ever try to use the existing configuration to
deal with the multiple cores because the negotiation cycle doesn't
change it's view on the state of a machine after allocating a job to
one of the slots on it until the job has started and the refresh back
to the collector has happened. At least some level of update would be
good (so job ranking and requirements can reference "this machine has
other jobs running on it" in a safe manner even if they can't use
general things like load/utilized memory etc.

Matt