[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Very high throughput computing



Sorry; forgot some key bits of info!

 

This is Windows 7 x64, Condor v 7.8.2

 

Not sure of completion times but in the order of 10s of minutes per job.

 

I’ll try and get some more info from him – this is the first time we’ve had a user submit more than 1 job at a time so it’s all new (but exciting!) territory

 

Thanks for all the advice so far.

 

Steve

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: 19 March 2013 16:13
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Very high throughput computing

 

Steve,

Most scaling issues of this sort can be addressed by adding more submit nodes.  However, in many situations, a single submit node can handle 2000-3000 running jobs without sweating, so some investigation into your case may be worthwhile.

Windows or linux?  Scaling a submit node under Linux is much better supported.

What version of HTCondor?

Does the machine grind to a halt due to thrashing of the swap device?  i.e. is it out of memory?  On a 64-bit machine running HTCondor 6.8, I'd expect each running job to require about 1.5MB on the submit machine.  16GB RAM should therefore be enough at your scale, but perhaps other things are eating some of the memory.

How long do individual jobs typically take to complete?  Job completion rates > ~20 Hz on a single submit node are possible, but may require some attention to details, such as the ephemeral port range.

--Dan

On 3/19/13 10:35 AM, Rochford, Steve wrote:

We have a user who is submitting a lot of jobs to our condor system. He’s hitting some limits and I want to work out how we can help.

 

He would like to be able to have 2000-3000 jobs running simultaneously – we have enough nodes to cope with this – but actually submitting them is causing problems.

 

Essentially his job is running the program but using slightly different parameters each time so he has a submit file with (eg) queue 500 at the end.

 

He can submit about 500 jobs simultaneously and everything works but trying to submit more than that and his machine grinds to a halt – presumably the overhead of communicating with all the nodes is too much (the machine has 16GB RAM and a reasonably decent CPU)

 

If I give him (say) another 6 machines set up as submit nodes will this work or will we hit other bottlenecks (or is this too vague a question??)

 

Thanks

 

Steve




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/