[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool

This is exactly the type of information for which I am looking. I knew that other groups are already doing what we need to do. Thank you for the Gabi's slides. He was one of the developers with whom I spoke in length about having sample layouts, and he seemed to think that this it would be a good idea.

Please keep the diagrams/explanations coming.


Michael Hess wrote:

a good starting point is this ppt:

Administrators Tutorials: Tips for Deploying Large Pools


You might also want to tune your linux for scalability (the submitters and the master):

condor site for Linux Scalability


Condor and High Availability is described here


Here at the University of Plymouth, we are running around 1400 nodes with Condor at the moment, and aim to scale this up to more then 4000 as soon as it is running stable. We are using 3 submitters (having 30.000 - 150.000 jobs in the queues) of different specs and one central manager, which does not submit anything. We are also about to set up a portal, which will handle the submission and will distribute it to the submitters. We are using a shadred network drive to store the data and make it accessable to all the submitters (which is a good thing in general I think). I really would recommand you to have more then one submitter, it is much better scalable. Please mind, that the condor_schedd (which launched the shadow processes etc) is a single thread program, so it can only use one CPU (the shadow processes use all CPUs). Also, for every running job, you will have one shadow process running (which consumes around 1MB of RAM), and having 1000 jobs running, is using a lot of Ram (more then 1GB only for the shadows). Generally, the Submitter machines needs to have a lot of Ram (2GB are working fine for us). You also might want to tune the delay_shadow parameter a bit, as starting a shadow every 2s is taking a lot of time (500 shadows = 1000s = ~16.66 min), we have decreased it to 0.1 and this is working good for us. If you want to have more tips, or if you are facing problems (condor thinks it is running more jobs then it is running shadows, jobs disapear into hyperspace), drop me an email, or ask at the list.
Best regards,

Michael Hess
PlymGrid Officer
University of Plymouth
Devon, UK

Condor-users mailing list