[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Very high throughput computing



Steve-

As I understand it, the submit process is also pretty IO bound. You might try some SSDs as the spool directory, and you might even try RAID 0-ing them. (This is on our to-try list, so your mileage will definitely vary.)

Given your current specs though, I think there might be some software options you could change. Are you using TCP packets? Sometimes that can reduce performance. Any other daemons running on the system taking up cycles can also adversely affect the system. It's always best to have a dedicated submit node. 

John Lambert


On Tue, Mar 19, 2013 at 12:13 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
Steve,

Most scaling issues of this sort can be addressed by adding more submit nodes.  However, in many situations, a single submit node can handle 2000-3000 running jobs without sweating, so some investigation into your case may be worthwhile.

Windows or linux?  Scaling a submit node under Linux is much better supported.

What version of HTCondor?

Does the machine grind to a halt due to thrashing of the swap device?  i.e. is it out of memory?  On a 64-bit machine running HTCondor 6.8, I'd expect each running job to require about 1.5MB on the submit machine.  16GB RAM should therefore be enough at your scale, but perhaps other things are eating some of the memory.

How long do individual jobs typically take to complete?  Job completion rates > ~20 Hz on a single submit node are possible, but may require some attention to details, such as the ephemeral port range.

--Dan


On 3/19/13 10:35 AM, Rochford, Steve wrote:

We have a user who is submitting a lot of jobs to our condor system. He’s hitting some limits and I want to work out how we can help.

 

He would like to be able to have 2000-3000 jobs running simultaneously – we have enough nodes to cope with this – but actually submitting them is causing problems.

 

Essentially his job is running the program but using slightly different parameters each time so he has a submit file with (eg) queue 500 at the end.

 

He can submit about 500 jobs simultaneously and everything works but trying to submit more than that and his machine grinds to a halt – presumably the overhead of communicating with all the nodes is too much (the machine has 16GB RAM and a reasonably decent CPU)

 

If I give him (say) another 6 machines set up as submit nodes will this work or will we hit other bottlenecks (or is this too vague a question??)

 

Thanks

 

Steve



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/