[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] Scaling to hundreds, then thousands of nodes
- Date: Wed, 3 Mar 2004 10:42:21 -0500
- From: "David Vestal" <dvestal@xxxxxxxxxxx>
- Subject: [condor-users] Scaling to hundreds, then thousands of nodes
I'm seeking advice. We're currently incorporating the computers of the local school system into a large, county-wide grid. After the rollout is completed, 4700 computers will be on it. I have a few concerns about this, and I would appreciate any advice from those who have already done this.
1) We're using a single machine as the central manager and the only submit machine. Is this inappropriate?
2) Can I use the CondorView module to create HTML grid-statistics pages under Windows?
3) Running under Windows, we're forced to use the Vanilla universe, and thus we checkpoint manually using checkpoint files written to the local directory. We've had to put in a pretty ugly workaround to fix what may be a flaw in Condor's ability to return files from Windows Vanilla-universe jobs. I'd like any opinions on how I can avoid doing that; see my previous message under the topic "output files not being returned upon preemption."
4) We're doing molecular simulations on grid nodes, and the required bandwidth is pretty intensive. To start a simulation on a grid node requires downloading just over 3MB of data. Getting simulation results requires uploading about 20MB of data. Restarting simulations requires uploading, then downloading anywhere from 3-20 MB of data. We want to run thousands of sims simultaneously, all of which could be preempted during the course of a typical school day. How can we best mitigate the exploding bandwidth requirements? Our central manager has a direct connection to a fiberoptic backbone connecting many schools, with T1s or T3s into the rest. However, I worry that my central manager may get swamped with returning files. After all, 2000 machines returning 20MB of data is 40GB, which could be problematic to say the least. Suggestions?
5) Does Condor have an intrinsic limitation that would prevent running thousands of jobs simultaneously?
Thanks for taking the time to read my overly-long questions.
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>