[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



Unlike slurm or PBS/Torque, HTCondor is not natively supported by Open
MPI as a launcher, so mpirun needs some help to set up the correct
environment. If your version of condor is new enough (v8.6+), there's
a bundled openmpiscript that should work for many cases.

The script should be in your examples directory (just like the
condor_config.local.dedicated_resource config example). On my machine,
this is:
/usr/share/doc/condor-8.7.6/examples/

See http://research.cs.wisc.edu/htcondor/manual/current/2_9Parallel_Applications.html
for instructions on how to use the openmpiscript (MPI Applications
subheading under the Submission Examples heading).

There was a nice improvement made to openmpiscript (and its helper
scripts) in v8.7.4 which may also be used with the v8.6.x condor
versions. Let me know if you're interested in the updated scripts and
I can get them to you off list.

Jason

On Tue, Jan 23, 2018 at 5:14 AM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> Hi,
> With two compute nodes with 2 and 4 cores, I submitted an mpi job with this
> content:
>
> universe = parallel
> executable = /opt/openmpi/bin/mpirun
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out
> error = hellompi.err
> machine_count = 2
> queue
>
>
> After the submission, I see this in the output file
>
> --------------------------------------------------------------------------
> Hello world from processor compute-0-1.local, rank 0 out of 4 processors
> Hello world from processor compute-0-1.local, rank 1 out of 4 processors
> Hello world from processor compute-0-1.local, rank 2 out of 4 processors
> Hello world from processor compute-0-1.local, rank 3 out of 4 processors
> Hello world from processor compute-0-1.local, rank 0 out of 4 processors
> Hello world from processor compute-0-1.local, rank 1 out of 4 processors
> Hello world from processor compute-0-1.local, rank 2 out of 4 processors
> Hello world from processor compute-0-1.local, rank 3 out of 4 processors
>
>
>
> So, it seems that the scheduler submits the job to compute-0-1 and run it
> twice due to the machine count. Is that right? Then why?
>
> I also used
>
> machine_count = 2
> request_cpus = 1
>
> to allocate two machines and  one cpu on each of them. However, I see
>
> Hello world from processor compute-0-1.local, rank 2 out of 4 processors
> Hello world from processor compute-0-1.local, rank 0 out of 4 processors
> Hello world from processor compute-0-1.local, rank 3 out of 4 processors
> Hello world from processor compute-0-1.local, rank 1 out of 4 processors
>
>
>
> Can someone shed a light on that? Note
>
> # condor_status -af:h Machine DedicatedScheduler
> Machine           DedicatedScheduler
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
>
>
>
>
> Regards,
> Mahmood
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/