[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] use free machines first but overload cpu's



Hi Steffen,

you are right, we do not realy parallize but split a job.
This is done on file base and each job processes about 100 files.

The idea is the following: We have similar machines which all can process 4 jobs without running into resource problems. Mostly a user can use one cpu for all of his jobs. But sometimes the cluster is overloaded and for this cases I want to have stil resources left for the 5 minute jobs. If this user have to wait in some rare cases 20 minutes it would be no problem. If this users have to wait one day in one case, I will run into trouble and condor will maybe not be acceped.

I have problems to understand the ranking
RANK=(7-VirtualMachineID)
seems to be a good idea.
Where I have to put this rank? In the local config file ?
I don't understand the class add method.

Harald



Steffen Grunewald wrote:

On Thu, Jan 19, 2006 at 06:26:14PM +0100, van Pee wrote:
Hi all,

My problem is the following: All users should have the same priority and can use all machines. Its intended to give all users maximum throughput. If there are small jobs which can be parallised they should always run!

Harald,

I'm a bit puzzled: first you're talking about vanilla (which is fine for
a lot of applications), then you want to parallelise. Condor vanilla is
meant for *serialised* tasks. If you want parallel execution, you will
need the MPI universe.

Let me assume that you meant "split a task into n subtasks which can run
independent of each other" - then it can happen that the same CPU (or
virtual machine, in Condor-speak) will process all n jobs if no other
resources are free. Remember that a maximum throughput solution may be unfair to individual users! It's the general throughput that counts - there's no guarantee that
you individual job batch will be finished within a given time range.
(Of course there are means to tweak the configuration to favor certain
classes of tasks, but that's not what you'd like to have at the very
beginning of your Condor experience.)

If I use just as many cpu's as there are (6 at the moment) than I can use just 6 jobs at once. If there would be a user who wants to run a job splitted to 6 cpu's (on filebase) which take in total 5 minutes it could happen, that
he have to wait for hours or days for this job, which is not acceptable.

If you got n cpus, and don't redefine virtual machines, there will be a one-to-one mapping of cpus to VMs, correct. Each of those VMs will get
negotiated (by the master) and matched with a job, and once it's finished
its work it will receive the next chunk of work. In our setup, a VM
negotiated for a certain user will stay assigned to that user until it
runs out of work - so if youu manage to grab at least one CPU odds are good
to finish the whole batch in limited time.

with NUM_CPUS = ,
I can change this, but it seems, that condor uses first all 6 (of course virtual) cpus of the first machine
and then it starts with the next one!

That depends on the negotiator cycles and will randomize over time.
You may prioritise using a RANK=(7-VirtualMachineID).

What I want to have is:
I allow a maximum of 4 jobs per real cpu. We have 2 types (later 3 or 4 types) of cpus: fast and faster.
condor should use
1. all faster cpu with one job
2. all fast cpu with one job

Use RANKing to prefer faster cpus, based on the classad attributes related to speed (MIPS or the like). To prefer slow machines, use
100000-MIPS :-)

Are you sure you want to run 4 jobs on a single CPU? What about real
and virtual memory? If the machine starts swapping, your execution times may explode.

if there are 6 jobs each real cpu should run one of them.
if there are 12 jobs, each real cpu should run two of them
and so on!

What's the point? If every real CPU has a single job to run it will do
so 100% if the time, and finish after time T. If the same real CPU (split into 2 VMs) has to run 2 jobs, it will run every job at max 50%,
and finish both after 2*T (or later, if swapping has to be accounted for).
In both cases, 2 jobs will be done after 2*T - the one-to-one solution
is far more predictable.

For me the condor configuration is too sophisticated and I don't find the
correct setting for the above task. Therefore it would be very helpful if someone can lead me in the right direction.

Don't try to do everything at the same time. Serialisation is a good
thing (unless you're MPIing). If you need dependencies, DAG will be your friend...

Cheers,
Steffen