[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Scaling to hundreds, then thousands of nodes




1) We're using a single machine as the central manager and the only submit machine. Is this inappropriate?

In general, I like to see multiple submit points and a distinct central manager if you have thousands of jobs. It's more likely to scale easily for you.


We recently helped set up a Condor pool with about 1,000 CPUs and 10,000 queued jobs that used only a single (Linux) submit point, and it works well. But if you experience problems scaling to 4,000 running jobs, you may need multiple submit machines.

Realize that each running job (not queued, idle jobs) will have a unique process associated with it on the submit machine.

2) Can I use the CondorView module to create HTML grid-statistics pages under Windows?

I don't know what you mean by that. Can you see this web page okay on Windows?


http://pumori.cs.wisc.edu/condor-view-applet/

Or are you asking if it can be hosted on a Windows web server? I don't know about that. It's a set of simple script, but currently they require the Bourne shell to run. They may be easy to get to run with Cygwin or to port, but I don't know.

4) We're doing molecular simulations on grid nodes, and the required bandwidth is pretty intensive. To start a simulation on a grid node requires downloading just over 3MB of data.

How similar is the data for each job? Does every job start with the same set of data, but different parameters? If so, there may be good ways to prestage the data, either manually or with some clever Condor technology.


Getting simulation results requires uploading about 20MB of data. Restarting simulations requires uploading, then downloading anywhere from 3-20 MB of data. We want to run thousands of sims simultaneously, all of which could be preempted during the course of a typical school day. How can we best mitigate the exploding bandwidth requirements? Our central manager has a direct connection to a fiberoptic backbone connecting many schools, with T1s or T3s into the rest. However, I worry that my central manager may get swamped with returning files. After all, 2000 machines returning 20MB of data is 40GB, which could be problematic to say the least. Suggestions?

Standard universe has the ability to choose an "appropriate" checkpoint server, which is a checkpoint server that you decide is "close" to the machine the job was running on.


I realize that this doesn't translate to the vanilla universe, but I wonder if we could do something similar. Hmmm...

5) Does Condor have an intrinsic limitation that would prevent running thousands of jobs simultaneously?

No intrinsinc limitation.


When a job is running, it is monitored by a process back on the submit machine, but no centralized process. So as long as your submit host isn't overwhelmed, you can have thousands of jobs running. This is why I say that having multiple submit points may be useful.

Let me ask you: why do you want a single submit point? Do you have a single user submitting jobs? Are you building a web-based portal that lets users submit jobs from their web browser? Or is it just for convenience?

The first two are good reasons for having a single submit point, and I recognize that it's a lot of work to use multiple submit points for them. :)

-alain


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>