[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



OK I understand what you say, but in practice, I see something else which I can not figure out. I followed the examples in the manual.
This time, I requested 2 machines and want to allocate 1 core from each of them. But again, I see only compute-0-1 response.



[mahmood@rocks7 mpi]$ cat mpi.ht
universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out.$(Node)
error = hellompi.err.$(Node)
machine_count = 2
request_cpus = 1
queue
[mahmood@rocks7 mpi]$ cat hellompi.out.0
Hello world from processor compute-0-1.local, rank 1 out of 2 processors
Hello world from processor compute-0-1.local, rank 0 out of 2 processors
[mahmood@rocks7 mpi]$ cat hellompi.out.1
[mahmood@rocks7 mpi]$ cat hellompi.err.0
mkdir: cannot create directory '/var/opt/condor/execute/dir_17113/tmp': File exists
[mahmood@rocks7 mpi]$ cat hellompi.err.1
mkdir: cannot create directory '/var/opt/condor/execute/dir_17114/tmp': File exists
[mahmood@rocks7 mpi]$



Regards,
Mahmood


On Friday, January 26, 2018, 7:07:02 PM GMT+3:30, Jason Patton <jpatton@xxxxxxxxxxx> wrote:


Let's take your example with a condor pool containing:

Machine1 with 1 core
Machine2 with 2 cores

If you submit a parallel universe job with "machine_count = 3" as the
only requirement, condor will try to schedule three slots in the pool
with one core each. This will work **if** your pool is configured to
have one static slot per core (the default) or if you are using
partitionable slots. However, if your pool is configured with a single
static slot on each machine (perhaps with each slot containing all of
the cores), then your job will not match because you will only have
two slots -- one on Machine1 with one core, one on Machine2 with two
cores.

It's difficult to address if specific examples will work without
knowing exactly how your pool is configured.

Based on the condor_status output you've provided in these threads, it
seems that you have one static slot per core. This means that you can
only submit jobs that request a single core per slot (or per
node/machine in the parallel universe), but you can request as many
nodes (machine_count) as you want up to the number of slots in your
pool.

If you want to submit jobs that request more than a single core (e.g.
request_cpus = 2), then you will need to reconfigure your pool to have
more than one core per slot or consider using partitionable slots.
Here's the manual for configuring slots (Section 3.5.10):
https://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html

Jason