[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] New to Condor - Difficult (I think) problem...
- Date: Mon, 15 May 2006 09:59:01 -0700
- From: "Bob Mortensen" <condor@xxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] New to Condor - Difficult (I think) problem...
I'm new to condor and distributed computing, so the problem I'm trying to
solve may be trivial, difficult or impossible; briefly, here is what I
need to do.
We have a pool of multi-CPU (actually dual-CPU) windows machines that we
would like to maximize the use of CPU time on. We have three types of jobs
to be run with the following requirements for each job type:
1. Single-CPU (about 80% of jobs). These jobs require only one CPU and
thus can run concurrently on the same multi-CPU machine up to the number
of CPUs on the machine. This seems easy enough and should work "straight
out of the box".
2. Multi-CPU (about 15% of jobs). These jobs require all the CPUs on the
machine and no other job running on the machine. The application will take
care of starting it's own processes/threads to make full use of all CPUs.
3. Multi-CPU, Multi-Machine (about 5% of jobs). These jobs require
multiple multi-CPU machines, one master and one or more "slaves". Each
machine will be dedicated to this job (i.e. no other jobs on these
machines). The application, running on the "master" machine will take care
of starting it's own processes/threads (local and remote) to fully utilize
the machines assigned to the job. In addition, the "master" machine needs
to get a list of all the "slave" machines. (It may be sufficient to limit
this to one slave.)
Once started, each job must complete before another is started. If it
helps, we may be able to identify two machines to handle the "Multi-CPU,
Multi-Machine" case, as long as they can also run type 1 and 2 jobs when
type 3 jobs are not in the queue. Writing scripts around the application
to gather information to pass to the application is also a possible
solution (we have MKS and perl available on all machines).
If this is fairly straight-forward, please say so, but also point in the
direction of some documentation and preferably examples.
Any pointers and/or advise will be greatly appreciated.