[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] PROBLEMS WITH HYPER THREADING



> Hi Ian, thank you so much! yes it makes sense!, but I wasn't clear.
> I've a cluster with 12 P4 HT 3 GHz. Then I submit my job that 
> has queue 40, and each time must make 1,800 runs (its a 
> genetic algorithm).  It is not paralellized.

Ahh! That clears things up thanks.

> When I submit my job it takes 2 hrs without constrains, then 
> I say "just VM1" and it takes the same time.
> In the machines I've put (is this config.ok?):
> 
> NUM_CPUS = 2
> NUM_VIRTUAL_MACHINES_TYPE_1= 2

Are you defining custom virtual machine types? If you're not creating
custom virtual machines you can comment the line above
(NUM_VIRTUAL_MACHINES_TYPE_1) out.

> NUM_VIRTUAL_MACHINES = 2
> VIRTUAL_MACHINES_CONNECTED_TO_CONSOLE = 2 
> VIRTUAL_MACHINES_CONNECTED_TO_KEYBOARD = 2 
> COUNT_HYPERTHREAD_CPUS = TRUE

It won't matter what you set COUNT_HYPERTHREAD_CPUS to because you're
overriding Condor's "take a guess at how many CPUs there are" mechanism
when you explcitly set NUM_CPUS and NUM_VIRTUAL_MACHINES.

> Then I tried to run with queue 
> 40, but each job with 3 runs(not 1,800) just to obtain results faster.

Got it. I'm understanding what you're doing now. Job lands on a machine:
does 3 itterations of your algorithm in serial. Right?

> In this scenario, if I submit the job, without constrains it 
> takes half the time that if I put just vm1, for example.

This makes sense. If you have 12 machines each with two slots and you
constrain your jobs to only run on slot 1 you you'll run 12 instances of
your algorithm in parallel. But if you omit the slot constraint you'll
run 24 instances of your algorithm in parallel so your jobs will run
roughly twice as fast (assuming all things are equally w.r.t. your
jobs).

But: this is different from what you wrote in the first paragraph of
this post. In your first paragraph you make it sound like with or
without a slot constraint the same number of jobs, running the same
number of algorithm itterations per job, takes the same amount of time.
Is this the problem?

> The executable is the same, so I don't know why is beheaving 
> like that.

One possibility might be that only 1/2 the slots on your machines are
available to run jobs because the other half are in the owner state.
What does condor_status show for your pool?

- Ian


Confidentiality Notice.  This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, 
or copying of this message, or any attachments, is strictly prohibited.  If you have received this message in error, 
please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.