[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Very high throughput computing



We have 15 VM submit nodes for our windows users to make use of.

 

They are configured with 4 cores, 16Gb RAM, and running Windows Server 2008.

They happily handle MAX_JOB_RUNNING=2000 (our total cores available ~10,000).

All currently running Condor 7.6.7

 

A few caveats:

 

We get our users to use a batch file as the executable. This is then responsible for copying

the “real” executable (and any auxiliary files required, e.g. with compiled MATLAB or Python)

plus input data files. It then copies output data files back to the file server. This greatly reduces

the load on the submit machine.

 

We set MAX_CONCURRENT_UPLOADS and MAX_CONCURRENT_DOWNLOADS to 0 (zero=unlimited).

This improves the overall throughput and delays when transmitting *.err, *.out, *.log files.

 

Turn off email notification.

 

Set the desktop heap size for non-interactive desktops to 1240 in

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows

 

Users can submit jobs from multiple submit VMs to increase overall throughput.

_________________________________________________________________________

 

We are testing/tweaking a test submit node to see if we can increase the number of concurrently

running jobs.

 

Windows Server 2008 VM, 8 core, 32Gb RAM and producing graphs vs time of job info from

condor_q (jobs, idle, running) and condor_status (using –run option querying our 5 different pools

as well as our condorview server that they all report to), and number of condor_shadows on the

submit node (using a scheduled task and the DOS tasklist command). Some interesting info

but struggling to get better throughput than with MAX_JOBS_RUNNING=2000. Still a work in

progress though.

 

Cheers

 

Greg

 

P.S. Of course, ignore this if you’re in the linux space and not using Windows! J

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John Lambert
Sent: Wednesday, 20 March 2013 4:10 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Very high throughput computing

 

Steve-

 

As I understand it, the submit process is also pretty IO bound. You might try some SSDs as the spool directory, and you might even try RAID 0-ing them. (This is on our to-try list, so your mileage will definitely vary.)

 

Given your current specs though, I think there might be some software options you could change. Are you using TCP packets? Sometimes that can reduce performance. Any other daemons running on the system taking up cycles can also adversely affect the system. It's always best to have a dedicated submit node. 

 

John Lambert

 

On Tue, Mar 19, 2013 at 12:13 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:

Steve,

Most scaling issues of this sort can be addressed by adding more submit nodes.  However, in many situations, a single submit node can handle 2000-3000 running jobs without sweating, so some investigation into your case may be worthwhile.

Windows or linux?  Scaling a submit node under Linux is much better supported.

What version of HTCondor?

Does the machine grind to a halt due to thrashing of the swap device?  i.e. is it out of memory?  On a 64-bit machine running HTCondor 6.8, I'd expect each running job to require about 1.5MB on the submit machine.  16GB RAM should therefore be enough at your scale, but perhaps other things are eating some of the memory.

How long do individual jobs typically take to complete?  Job completion rates > ~20 Hz on a single submit node are possible, but may require some attention to details, such as the ephemeral port range.

--Dan

 

On 3/19/13 10:35 AM, Rochford, Steve wrote:

We have a user who is submitting a lot of jobs to our condor system. He’s hitting some limits and I want to work out how we can help.

 

He would like to be able to have 2000-3000 jobs running simultaneously – we have enough nodes to cope with this – but actually submitting them is causing problems.

 

Essentially his job is running the program but using slightly different parameters each time so he has a submit file with (eg) queue 500 at the end.

 

He can submit about 500 jobs simultaneously and everything works but trying to submit more than that and his machine grinds to a halt – presumably the overhead of communicating with all the nodes is too much (the machine has 16GB RAM and a reasonably decent CPU)

 

If I give him (say) another 6 machines set up as submit nodes will this work or will we hit other bottlenecks (or is this too vague a question??)

 

Thanks

 

Steve

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/