[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] PROBLEMS WITH HYPER THREADING



Hi Ian, thank you so much! yes it makes sense!, but I wasn't clear.
I've a cluster with 12 P4 HT 3 GHz. Then I submit my job that has queue
40, and each time must make 1,800 runs (its a genetic algorithm).  It is
not paralellized.
When I submit my job it takes 2 hrs without constrains, then I say "just
VM1" and it takes the same time.
In the machines I've put (is this config.ok?):

NUM_CPUS = 2
NUM_VIRTUAL_MACHINES_TYPE_1= 2
NUM_VIRTUAL_MACHINES = 2
VIRTUAL_MACHINES_CONNECTED_TO_CONSOLE = 2
VIRTUAL_MACHINES_CONNECTED_TO_KEYBOARD = 2
COUNT_HYPERTHREAD_CPUS = TRUE
Then I tried to run with queue 40, but each job with 3 runs(not 1,800)
just to obtain results faster.
In this scenario, if I submit the job, without constrains it takes half
the time that if I put just vm1, for example.
The executable is the same, so I don't know why is beheaving like that.


>> Thanks for your answer! My problem is that the executable
>> must do 1.800 runs and the submit file has queue = 40.
>
> I have to admit: I have no idea what that means! :)
>
>> I tried to run the job 3 runs instead of 1.800 just to obtain
>> some results faster, and if I put just vm1 or vm2 the
>> execution time doubled (aprox) the execution time when I
>> didn't specified vm1 or vm2.
>
> If you limit your run to one slot or the other, and you're the only user
> in the system you're spreading your jobs across your machines such that
> *if* they fork multiple threads you're ensuring those threads have
> mostly free CPUs to run on. If you don't limit your jobs to just one
> slot or the other and condor schedules multiple multi-threaded jobs on
> one machine your throughput suffers as the kernel context switches
> between all the threads.
>
> For example, lets say you have two machines each configured with 2 slots
> and each having 4 CPUs (either physical or by way of hyperthreading).
> And lets say your jobs fork four work threads to do their work. It
> spawns the threads and sleeps until they're done some computationally
> intensive task.
>
> --> If you submit 4 jobs but limit them to run in only slot 1 every time
> a job gets assigned to a machine it spawns four threads and each thread
> gets access, nearly exclusivily, to the 4 CPUs on the machine because
> Condor won't run more than 1 job on each of the two machines in your
> system. 2 jobs run in parallel, one on each machine, and then the next
> two jobs run in parallel, one on each machine, when the first two
> complete.
>
> --> If submit the same four jobs but allow them to run anywhere the
> Condor will schedule all four jobs to run in parallel: two on one
> machine in slot 1 and 2, two more on the other machine in slot 1 and 2.
> Your jobs will each spawn 4 threads making the number of threads spawned
> twice as high as the number of CPUs on the machine. So now the kernel
> has to time slice and share the 4 CPUs between 8 threads. Depending on
> what you're doing context switch can take non-neligable amounts of time
> (especially if each thread uses a large amount of RAM each relative to
> what your machines have). So while you may have 4 jobs running in
> parallel now you won't see a 2x speedup in throughput because of all
> that extra load on your machines.
>
> Does that make sense?
>
>> Now I must obtein results with 1.800 runs, and the times are
>> the same, more or less.  What can happen? Please and Thanks!
>> Paula
>
> I'll confess I don't really understand what you're trying to do or what
> 1.8 vs. 3.0 means (that's not Condor specific) stuff. But all I can say
> is avoid overloading your machines. Make sure you always have enough
> CPUs to handle all the threads your jobs spawn.
>
> - Ian
>
>
> Confidentiality Notice.  This message may contain information that is
> confidential or otherwise protected from disclosure.
> If you are not the intended recipient, you are hereby notified that any
> use, disclosure, dissemination, distribution,
> or copying of this message, or any attachments, is strictly prohibited.
> If you have received this message in error,
> please advise the sender by reply e-mail, and delete the message and any
> attachments.  Thank you.
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>


Ing. Paula Marti­nez
ITU - Redes y Telecomunicaciones

Ing. Paula Marti­nez
ITU - Redes y Telecomunicaciones