Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool

Date: Thu, 25 May 2006 09:32:28 -0400
From: Jess Cannata <jac67@xxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool

This is exactly the type of information for which I am looking. I knewthat other groups are already doing what we need to do. Thank you forthe Gabi's slides. He was one of the developers with whom I spoke inlength about having sample layouts, and he seemed to think that this itwould be a good idea.


Please keep the diagrams/explanations coming.

Jess

Michael Hess wrote:

Hi,

a good starting point is this ppt:

Administrators Tutorials: Tips for Deploying Large Pools

http://www.nesc.ac.uk/talks/438/12th/deploying_large_pools.ppt

You might also want to tune your linux for scalability (the submitters and the master):

condor site for Linux Scalability

http://www.cs.wisc.edu/condor/condorg/linux_scalability.html

Condor and High Availability is described here

http://dsl.cs.technion.ac.il/projects/gozal/docs/CondorWeek2005_HA_presentation.pdf
Here at the University of Plymouth, we are running around 1400 nodes with Condor at the moment, and aim to scale this up to more then 4000 as soon as it is running stable. We are using 3 submitters (having 30.000 - 150.000 jobs in the queues) of different specs and one central manager, which does not submit anything.We are also about to set up a portal, which will handle the submission and will distribute it to the submitters.We are using a shadred network drive to store the data and make it accessable to all the submitters (which is a good thing in general I think).I really would recommand you to have more then one submitter, it is much better scalable. Please mind, that the condor_schedd (which launched the shadow processes etc) is a single thread program, so it can only use one CPU (the shadow processes use all CPUs). Also, for every running job, you will have one shadow process running (which consumes around 1MB of RAM), and having 1000 jobs running, is using a lot of Ram (more then 1GB only for the shadows).Generally, the Submitter machines needs to have a lot of Ram (2GB are working fine for us). You also might want to tune the delay_shadow parameter a bit, as starting a shadow every 2s is taking a lot of time (500 shadows = 1000s = ~16.66 min), we have decreased it to 0.1 and this is working good for us.If you want to have more tips, or if you are facing problems (condor thinks it is running more jobs then it is running shadows, jobs disapear into hyperspace), drop me an email, or ask at the list.
Best regards,

Michael Hess
PlymGrid Officer
University of Plymouth
Devon, UK







_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool
  - From: Michael Hess

Prev by Date: Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool
Next by Date: Re: [Condor-users] scheduling problem?
Previous by thread: Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool
Next by thread: [Condor-users] Condor for Mac OSX intel???
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Request for Ideas/Plans: Designing a LargeCondor Pool