[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



That is strange. I believe I have the same test program as your
mpihello, so I will see if I can reproduce this behavior.

In the meantime, can you run your test job with machine_count = 5? (No
multiple requirements/queue statements.)

Jason

On Fri, Jan 26, 2018 at 1:35 PM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> I am sorry but this time it puts the job on compute-0-0 only. I understand
> the logic that you said, but it is really weird.
>
> [mahmood@rocks7 mpi]$ cat mpi.ht
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out.$(Node)
> error = hellompi.err.$(Node)
> request_cpus = 1
> # set requirements for first execute node
> requirements = Machine == "compute-0-0.local"
> machine_count = 1
> queue
> # set requirements for second execute node
> requirements = Machine == "compute-0-1.local"
> machine_count = 1
> queue
> [mahmood@rocks7 mpi]$ condor_submit mpi.ht
> Submitting job(s)..
> 1 job(s) submitted to cluster 31.
> [mahmood@rocks7 mpi]$ cat hellompi.out.0
> Hello world from processor compute-0-0.local, rank 1 out of 2 processors
> Hello world from processor compute-0-0.local, rank 0 out of 2 processors
> [mahmood@rocks7 mpi]$ cat hellompi.out.1
> [mahmood@rocks7 mpi]$ cat hellompi.err.0
> mkdir: cannot create directory '/var/opt/condor/execute/dir_28046/tmp': File
> exists
> [mahmood@rocks7 mpi]$ cat hellompi.err.1
> mkdir: cannot create directory '/var/opt/condor/execute/dir_28508/tmp': File
> exists
> [mahmood@rocks7 mpi]$ condor_status -af:h Machine DedicatedScheduler
> Machine           DedicatedScheduler
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> [mahmood@rocks7 mpi]$ condor_status
> Name                    OpSys      Arch   State     Activity LoadAv Mem
> ActvtyTime
>
> slot1@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Idle      0.000 1973
> 0+00:00:03
> slot2@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1973
> 3+22:04:01
> slot1@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Idle      0.000  986
> 0+00:01:13
> slot2@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986
> 0+03:29:59
> slot3@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986
> 3+23:02:22
> slot4@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  986
> 3+23:02:22
>
>                      Total Owner Claimed Unclaimed Matched Preempting
> Backfill  Drain
>
>         X86_64/LINUX     6     0       2         4       0          0
> 0      0
>
>                Total     6     0       2         4       0          0
> 0      0
> [mahmood@rocks7 mpi]$ condor_q
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:9618?... @ 01/26/18
> 14:31:41
> OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
>
> 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
> [mahmood@rocks7 mpi]$
>
>
>
> Regards,
> Mahmood
>
>
> On Friday, January 26, 2018, 11:45:41 AM EST, Jason Patton
> <jpatton@xxxxxxxxxxx> wrote:
>
>
> When you specify "machine_count = 2", you are asking for your job to
> land on any two slots in your condor pool. These two slots could be on
> the same physical machine, which is likely what has happened.
>
> If you want to *test* that your job can land on each machine, you can
> set requirements per node in your submit file:
>
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out.$(Node)
> error = hellompi.err.$(Node)
> request_cpus = 1
>
> # set requirements for first execute node
> requirements = Machine == "compute-0-0.local"
> machine_count = 1
> queue
>
> # set requirements for second execute node
> requirements = Machine == "compute-0-1.local"
> machine_count = 1
> queue
>
> This will get you a two-node parallel universe job where one node is
> restricted to running on compute-0-0 and the other node is restricted
> to running on compute-0-1.
>
> Jason
>