[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Some questions regarding Condor, Clusters andThe Grid



Heinz, Chris, and other readers:

Thanks for looking into Condor.

The terminology of computing is heavily overloaded, so I feel your pain
regarding the meaning of "cluster" and the "grid."  We aren't too picky
about etymology, so whatever terms you want to use are fine with us!
In general, we use those terms as rough synonyms for "a bunch of computers."

If you are currently evaluating how Condor can be used, there are two key
points that I would like you to come away with:
	1 - Condor can manage many different computing configurations.
	2 - The Condor Project encompasses much more than just CPUs.

*** Condor can manage many different computing configurations.

Condor is not "merely" a "cluster" management system. A single Condor pool
can encompass many different kinds of computing structures.  For example,
the Condor pool at UW-Madison includes several clusters in machine rooms,
several large SMPs, all of our office workstations, and machines in teaching
labs.  From Condor's perspective, it's just a bunch of computers.  Users
select what resources they want to use based on concrete properties such as
available memory, CPU type, and so forth.  For us, the notion of "cluster"
is just a purchasing matter and has little effect on how users identify and
harness resources.

The key problem here is that all of these computing configurations suffer
from the problems of autonomy and failures to a greater or lesser degree.
Personal workstations and "grid" environments experience frequent failures
as networks go down and independent owners and users enter and leave the
system.  Clusters are quantitatively more reliable, but still experience
failures.  For a short time, we were losing one cluster node a day due to
defective capacitors on the motherboards.  Our users didn't notice. Condor
is designed from the ground up to deal with these sorts of events: every
participant is free to back out, fail, or give up at any time without
compromising the operation of the system.  If you have Condor, you won't
care what hours the machine room is open.

One of our book chapters describes the high-level structure of Condor,
and explains how it relates to "clusters" and "grids" in the way that they
are usually defined:
	http://media.wiley.com/product_data/excerpt/90/04708531/0470853190.pdf

*** Condor encompasses more than just CPUs.

In 1985, the Condor project was conceived as a cycle-stealing system. Things
have changed since then.  The Condor project is heavily involved in data
access, scheduling, networking, and more.  Just to give you a flavor of all
the things we do, consider:

	The NeST storage appliance exports file storage as an allocable,
	auditable, secure resource to be consumed by remote users.
		http://www.cs.wisc.edu/condor/nest

	The Stork data scheduler makes data transfer a "job" that can
	be scheduled, logged, managed, and controlled in the same way
	as a job for executing a program on a CPU
		http://www.cs.wisc.edu/condor/stork

	The Hawkeye monitoring system can be used to oversee a computing
	cluster, allowing an administrator to probe system properties,
	schedule periodic tasks, and mine for unusual events.
		http://www.cs.wisc.edu/condor/hawkeye

	The Parrot virtual file system attaches ordinary unmodified
	applications to new storage devices such as NeST and GSI-FTP.
		http://www.cs.wisc.edu/condor/parrot

	And more and more, which you can read at:
		http://www.cs.wisc.edu/condor/dagman/

Please let us know if you have any more questions or ideas...

Cheers,
Doug
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>