[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



I am sorry but this time it puts the job on compute-0-0 only. I understand the logic that you said, but it is really weird.

[mahmood@rocks7 mpi]$ cat mpi.ht
universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out.$(Node)
error = hellompi.err.$(Node)
request_cpus = 1
# set requirements for first execute node
requirements = Machine == "compute-0-0.local"
machine_count = 1
queue
# set requirements for second execute node
requirements = Machine == "compute-0-1.local"
machine_count = 1
queue
[mahmood@rocks7 mpi]$ condor_submit mpi.ht
Submitting job(s)..
1 job(s) submitted to cluster 31.
[mahmood@rocks7 mpi]$ cat hellompi.out.0
Hello world from processor compute-0-0.local, rank 1 out of 2 processors
Hello world from processor compute-0-0.local, rank 0 out of 2 processors
[mahmood@rocks7 mpi]$ cat hellompi.out.1
[mahmood@rocks7 mpi]$ cat hellompi.err.0
mkdir: cannot create directory '/var/opt/condor/execute/dir_28046/tmp': File exists
[mahmood@rocks7 mpi]$ cat hellompi.err.1
mkdir: cannot create directory '/var/opt/condor/execute/dir_28508/tmp': File exists
[mahmood@rocks7 mpi]$ condor_status -af:h Machine DedicatedScheduler
Machine           DedicatedScheduler                         
compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
[mahmood@rocks7 mpi]$ condor_status
Name                    OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Idle      0.000 1973  0+00:00:03
slot2@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1973  3+22:04:01
slot1@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Idle      0.000  986  0+00:01:13
slot2@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986  0+03:29:59
slot3@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986  3+23:02:22
slot4@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986  3+23:02:22

                     Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

        X86_64/LINUX     6     0       2         4       0          0        0      0

               Total     6     0       2         4       0          0        0      0
[mahmood@rocks7 mpi]$ condor_q


-- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:9618?... @ 01/26/18 14:31:41
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[mahmood@rocks7 mpi]$



Regards,
Mahmood


On Friday, January 26, 2018, 11:45:41 AM EST, Jason Patton <jpatton@xxxxxxxxxxx> wrote:


When you specify "machine_count = 2", you are asking for your job to
land on any two slots in your condor pool. These two slots could be on
the same physical machine, which is likely what has happened.

If you want to *test* that your job can land on each machine, you can
set requirements per node in your submit file:

universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out.$(Node)
error = hellompi.err.$(Node)
request_cpus = 1

# set requirements for first execute node
requirements = Machine == "compute-0-0.local"
machine_count = 1
queue

# set requirements for second execute node
requirements = Machine == "compute-0-1.local"
machine_count = 1
queue

This will get you a two-node parallel universe job where one node is
restricted to running on compute-0-0 and the other node is restricted
to running on compute-0-1.

Jason