[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bulk submission of Jobs



I think Dan Bradley would be the best person to comment on the scalability tests in Condor. From what I know he has made some dramatic improvements to performance in the Schedd in the 6.9 series. He has specifically addressed things like holding large quantities of jobs.

Best,


matt

On Jul 3, 2007, at 12:39 AM, Ian Gregory wrote:


Matt, It would be nice to know what the Condor team have tested.
Can you point me in the right direction.
I can tell you straight away there are many scalable issues with Condor
and these are being tackled.  Here is what I am doing:

I can tell you, I set-up +500 nodes at Sydney uni with 1 master. Not many
submit machines.
Given the queing runs a for loop over the jobs (on the master), if there is a lot of jobs then it soaks up CPU. Submitting lots of jobs say >10,000
gave my plenty of problems.

The trick is to have as many submit nodes as possible and smaller job queus. This means the submit machine does the job queue processing and takes the burden off the master (don't make the master a submit machine-on large grids). I don't know the best 'largest' size for job queues , anyone have suggestions? I can tell you 3000 jobs with 7000 on hold is ok on 1 PC. (PC's assumed to be
dual core 3.4GHz >1gig ram).

NOTE: To put the 7000 on hold after submitting 10,000. Need to do this straight away. ie. have 2 different windows open one submitting and another with the following already
written and soon as the submission has completed hit enter.

condor_hold -constraint "ClusterId==XXXX && ProcId>3000"

NOTE: This line may give you problems if not quick enouigh since Condor does
have scale issues.  So try:

condor_hold -constraint "ClusterId==XXXX && ProcId>3000 && Proc<6000"

and keep increasing..... to 10,000


Ian Gregory
Sydney University


	>-----Original Message-----
	>From: Matthew Farrellee <matt@xxxxxxxxxxx>
	>Sent: 07/03/07 - 14:17
	>To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
	>Subject: Re: [Condor-users] Bulk submission of Jobs
	>
	>Esh,
	>
	>I've never tried to scale up that high. There shouldn't be any
	>inherent limitations on the number of jobs, but 50K is getting up
	>there. I know people on the Condor Team have tested more than that
>with condor_submit. If you can send me a small program that does your
	>submission I can try running it against a Schedd I have a maybe see
	>what's going on.
	>
	>Are you seeing anything in the Schedd log around the time of the
	>32600th job?
	>
	>Best,
	>
	>
	>matt
	>
	>On Jul 2, 2007, at 8:33 PM, Esh Esh wrote:
	>
	>> Hi,
	>>
	>> I am trying to submit 50000 job to condor system (Condor 6.8.2) in
>> loop using webservices. Every time I run this program I get a error
	>> after submitting 32600 jobs.
	>>
	>> Stack Trace gives:
	>>
	>> "java.net.connectionexception : Connection Refused"
	>>
	>> Has any body faced this problem earlier?
	>> Is this specific to condor 6.8.2? Or Is there any limit on the
	>> number of jobs that can be submitted?
	>>
	>> Thanks in Advance,
	>> -Esh.
	>>
	>>
	>> _______________________________________________
	>> Condor-users mailing list
	>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
	>> with a
	>> subject: Unsubscribe
	>> You can also unsubscribe by visiting
	>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
	>>
	>> The archives can be found at:
	>> https://lists.cs.wisc.edu/archive/condor-users/
	>
	>_______________________________________________
	>Condor-users mailing list
>To unsubscribe, send a message to condor-users- request@xxxxxxxxxxx with a
	>subject: Unsubscribe
	>You can also unsubscribe by visiting
	>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
	>
	>The archives can be found at:
	>https://lists.cs.wisc.edu/archive/condor-users/