[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



When you specify "machine_count = 2", you are asking for your job to
land on any two slots in your condor pool. These two slots could be on
the same physical machine, which is likely what has happened.

If you want to *test* that your job can land on each machine, you can
set requirements per node in your submit file:

universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out.$(Node)
error = hellompi.err.$(Node)
request_cpus = 1

# set requirements for first execute node
requirements = Machine == "compute-0-0.local"
machine_count = 1
queue

# set requirements for second execute node
requirements = Machine == "compute-0-1.local"
machine_count = 1
queue

This will get you a two-node parallel universe job where one node is
restricted to running on compute-0-0 and the other node is restricted
to running on compute-0-1.

Jason

On Fri, Jan 26, 2018 at 10:26 AM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> OK I understand what you say, but in practice, I see something else which I
> can not figure out. I followed the examples in the manual.
> This time, I requested 2 machines and want to allocate 1 core from each of
> them. But again, I see only compute-0-1 response.
>
>
>
> [mahmood@rocks7 mpi]$ cat mpi.ht
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out.$(Node)
> error = hellompi.err.$(Node)
> machine_count = 2
> request_cpus = 1
> queue
> [mahmood@rocks7 mpi]$ cat hellompi.out.0
> Hello world from processor compute-0-1.local, rank 1 out of 2 processors
> Hello world from processor compute-0-1.local, rank 0 out of 2 processors
> [mahmood@rocks7 mpi]$ cat hellompi.out.1
> [mahmood@rocks7 mpi]$ cat hellompi.err.0
> mkdir: cannot create directory '/var/opt/condor/execute/dir_17113/tmp': File
> exists
> [mahmood@rocks7 mpi]$ cat hellompi.err.1
> mkdir: cannot create directory '/var/opt/condor/execute/dir_17114/tmp': File
> exists
> [mahmood@rocks7 mpi]$
>
>
>
> Regards,
> Mahmood
>
>
> On Friday, January 26, 2018, 7:07:02 PM GMT+3:30, Jason Patton
> <jpatton@xxxxxxxxxxx> wrote:
>
>
> Let's take your example with a condor pool containing:
>
> Machine1 with 1 core
> Machine2 with 2 cores
>
> If you submit a parallel universe job with "machine_count = 3" as the
> only requirement, condor will try to schedule three slots in the pool
> with one core each. This will work **if** your pool is configured to
> have one static slot per core (the default) or if you are using
> partitionable slots. However, if your pool is configured with a single
> static slot on each machine (perhaps with each slot containing all of
> the cores), then your job will not match because you will only have
> two slots -- one on Machine1 with one core, one on Machine2 with two
> cores.
>
> It's difficult to address if specific examples will work without
> knowing exactly how your pool is configured.
>
> Based on the condor_status output you've provided in these threads, it
> seems that you have one static slot per core. This means that you can
> only submit jobs that request a single core per slot (or per
> node/machine in the parallel universe), but you can request as many
> nodes (machine_count) as you want up to the number of slots in your
> pool.
>
> If you want to submit jobs that request more than a single core (e.g.
> request_cpus = 2), then you will need to reconfigure your pool to have
> more than one core per slot or consider using partitionable slots.
> Here's the manual for configuring slots (Section 3.5.10):
> https://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html
>
> Jason
>