[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] PROBLEMS WITH HYPER THREADING



> Thanks for your answer! My problem is that the executable 
> must do 1.800 runs and the submit file has queue = 40.

I have to admit: I have no idea what that means! :)

> I tried to run the job 3 runs instead of 1.800 just to obtain 
> some results faster, and if I put just vm1 or vm2 the 
> execution time doubled (aprox) the execution time when I 
> didn't specified vm1 or vm2.

If you limit your run to one slot or the other, and you're the only user
in the system you're spreading your jobs across your machines such that
*if* they fork multiple threads you're ensuring those threads have
mostly free CPUs to run on. If you don't limit your jobs to just one
slot or the other and condor schedules multiple multi-threaded jobs on
one machine your throughput suffers as the kernel context switches
between all the threads.

For example, lets say you have two machines each configured with 2 slots
and each having 4 CPUs (either physical or by way of hyperthreading).
And lets say your jobs fork four work threads to do their work. It
spawns the threads and sleeps until they're done some computationally
intensive task.

--> If you submit 4 jobs but limit them to run in only slot 1 every time
a job gets assigned to a machine it spawns four threads and each thread
gets access, nearly exclusivily, to the 4 CPUs on the machine because
Condor won't run more than 1 job on each of the two machines in your
system. 2 jobs run in parallel, one on each machine, and then the next
two jobs run in parallel, one on each machine, when the first two
complete.

--> If submit the same four jobs but allow them to run anywhere the
Condor will schedule all four jobs to run in parallel: two on one
machine in slot 1 and 2, two more on the other machine in slot 1 and 2.
Your jobs will each spawn 4 threads making the number of threads spawned
twice as high as the number of CPUs on the machine. So now the kernel
has to time slice and share the 4 CPUs between 8 threads. Depending on
what you're doing context switch can take non-neligable amounts of time
(especially if each thread uses a large amount of RAM each relative to
what your machines have). So while you may have 4 jobs running in
parallel now you won't see a 2x speedup in throughput because of all
that extra load on your machines.

Does that make sense?

> Now I must obtein results with 1.800 runs, and the times are 
> the same, more or less.  What can happen? Please and Thanks!
> Paula

I'll confess I don't really understand what you're trying to do or what
1.8 vs. 3.0 means (that's not Condor specific) stuff. But all I can say
is avoid overloading your machines. Make sure you always have enough
CPUs to handle all the threads your jobs spawn.

- Ian


Confidentiality Notice.  This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, 
or copying of this message, or any attachments, is strictly prohibited.  If you have received this message in error, 
please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.