[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor Cluster with Multiple NAS

Dear Group,Â

First time question, long time condor user.Â

* Background
We have an ubuntu condor cluster (12 machines, 102 slots) and one NAS (16TB) connected via a 20 port gigabit router. The NAS is mounted on every machine.Â

Models are read and write intensive, calculation time is about 10% of the total wall clock time on a single machine with no contention.

Conceptually, we run 15 high-level models that each consist of O(1000) sub-models (pure parallelism). All models read from the same database for their input ("input database") but write to their own database ("write database"). Hierarchically, all submodels need to be eventually stored in the same directory but there is no relation among the high-level models, they can be stored on physically different drives. Â

All models are submitted from the same user at the same time. That is, when we integrate the models, we will queue ~15,000 jobs. Note in practice, we limit the queue to 1000 and release models into the queue as they roll off.Â

* Problem
Once central NAS can't keep up when scaling up on the cluster and we want to find a cheap, easy to maintain solution.Â

* Propose Solution requiring input

(1) Buy q (ie 10) small NAS'es and mount them to all machinesÂ
(2) Input Database to be distributed to 1..n (ie 10)Â nodes (basically an rsync)

For each high-level model runÂÂ
- All submodels told to read from the "nearest" replicated input-databaseÂÂ
- All submodels told to write to the same unique NAS
- Limit each high-level model to 10 slots on the cluster
We hope to then be able to submit 10 models (100 slots) without slowdown.Â

One operational question we can't work out is how to limit each high level job to 10 slots and submit 10 jobs at a time?Â

Thanks for your time and thoughts,