[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Slow Performance



On 4/27/2014 10:22 AM, Dennis Zheleznyak wrote:
My question may not be connected directly to Condor but I'd like to know if
anyone encountered the same issue as me.

I bought a a Dell 720xdserver with an x2 Intel E5-2660 v2 CPUs, 256GB DDR3
and 40TB of data that has a RAID6 over it. with HyperThreading it has 40
cores. It has Windows Server 2012 on it.

My program isn't build with MPI capabilities, it calculates data from an
input file and outputs to a file once it is done - the program was compiled
with MatLab.

Normally I have 150 sets of data to be caulates. When I send it to condor
40 jobs start and that's great - the problem is that it takes forever to
finish a even one simple little job! The CPU is constantly working at 100%,
the memory barely gets to 10% and there is no special IO on the disks that
I can mention.

Before I bought the server, I had 4 computers with i7 4770K Haswell and
16GB of memory - the jobs literary flew when I sent it to my condor pool !

I don't know what to check or do - if anyone has any idea I would
appreciate it.

Thank you,
Dennis.


A few random first-thought suggestions:

1. Are you compiling and running with the -singleCompThread command-line argument to MATLAB? From how you have things setup about, you will want -singleCompThread so that MATLAB only uses a single core, else MATLAB will startup and each of your 40 jobs will try to use all 40 cores! Even if this was happening with your old servers, the issue will become much more pronounced on a machine with more cores. See
  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToRunMatlab
for this and other tips.


2. I would suggest a quick experiment - try running without hyperthreading and see if that improves things. Even if it doesn't, at least you eliminated a possible issue. To do so, in the condor_config.local for that machine set
  COUNT_HYPERTHREAD_CPUS = False
and then restart HTCondor. Specifically, you just need to restart the condor_startd, so you could do
  condor_restart -startd <machine-name>
from your central manager. When HTCondor restarts, you will see less slots as HTCondor will only count physical cores, not hyperthread cores. Resubmit your jobs and see what happens.

3. Another suggestion - if it is easy, what happens if you start 40 runs of your job simultaneously outside of HTCondor? We expect things will be equally slow outside of HTCondor, but it would be a nice data point to confirm this.

regards,
Todd