[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Network traffic and data storage recommendations [SEC=UNCLASSIFIED]



On Thu, 13 Jan 2011, Lenou, Peter (Contractor) wrote:

UNCLASSIFIED

Hi,



I am new to Condor and was wondering what schemes people in the Condor
community use to manage the amount of data and network traffic their
jobs produce.


The key thing you haven't told us here is how big is your condor pool
and how many of these 10,000,000 jobs would be running simultaneously.
In general you would not want to use the default condor transfer
file mechanisms to transfer 10M 1Gb files.  If I remember
correctly the new development version of condor is going to allow
a pluggable file transfer mechanism so as to specify what mechanism
you use to transfer the files. i.e. gridftp, hadoop, http, etc.



For example, my Condor requirements are that I submit a job specifying
an input and output file via command line arguments i.e.
MyExecutable.exe -in inputFile -out outputFile. In extreme cases, each
batch may contain 10 million jobs, each creating an output file 1GB in
size.

I would want the output files to be transferred back to the submit
machine as the jobs complete in order to limit network traffic, and the
submit machines won't have a large amount of direct attached storage so
each execution machine in the pool would have direct attached storage
for the output files. I will then have a daemon to bring all the output
files together to a central location for analysis.


It would have to be some daemon...that is a tall order.


Does this sound like a feasible solution?



Is there a better solution and how would this be implemented i.e.
network architecture, ClassAds etc?



Since you say all your execution machines have direct attached
storage you might seriously consider using Hadoop and HDFS.
That's a cheap way to make commodity computers be able
to share large amounts of block data.. and if you go
that step, you might also consider MapReduce or some other
product like it to do the bringing together and concatenation of the files. Every execute node would be serving some of your
data and thus you have a many-to-many network problem instead
of a many-to-one.

Also--I doubt very much if you could put 10 million jobs into
a single condor cluster.  DAGman is your friend here, and
maybe some of the higher-level technologies that use it such as Pegasus
or Swift.

Steve Timm




How do other users in the Condor community deal with large data files
and network traffic?



PL



IMPORTANT: This email remains the property of the Department of Defence
and is subject to the jurisdiction of section 70 of the Crimes Act 1914.
If you have received this email in error, you are requested to contact
the sender and delete the email.




--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.