[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MPI on Windows



Hello Sam,

 

Iâve got some experience with MPI jobs using condor, but only with Linux. We havenât had a requirement for Windows so far (thankfully).

 

If you havenât already, you should probably read the documentation for MPI jobs:

https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html

https://htcondor.readthedocs.io/en/latest/admin-manual/setting-up-special-environments.html#htcondor-s-dedicated-scheduling

 

Hereâs how I personally configure my clusters. (most basic settings only)

 

Central manager server (all-in-one):

/etc/condor/config.d/01-cm.config

 

# Common configuration (MASTER)

CONDOR_HOST = $(hostname --short)

ALLOW_DEAMON = $NET_INT_PREFIX.*

# Configure host for central management (COLLECTOR, NEGOTIATOR)

use ROLE: get_htcondor_central_manager

# Configure host for submission of jobs (SCHEDD)

use ROLE: get_htcondor_submit

# Enable partitionable slot preemption

ALLOW_PSLOT_PREEMPTION = True

# Speed up reclaiming of unused slots

UNUSED_CLAIM_TIMEOUT = 20

 

And for the execute nodes (compute servers):

/etc/condor/config.d/02-role-execute.config

 

# Common configuration (MASTER)

CONDOR_HOST = $(hostname --short)

ALLOW_DEAMON = $NET_INT_PREFIX.*

# Configure host for jobs execution (STARTD)

use ROLE: get_htcondor_execute

# Link node to central manager

UID_DOMAIN = $(hostname --short)

TRUST_UID_DOMAIN = TRUE

# Prioritize parallel jobs over serial

DedicatedScheduler = "DedicatedScheduler@$(hostname --short)"

STARTD_ATTRS = \$(STARTD_ATTRS), DedicatedScheduler

START = True

SUSPEND = False

CONTINUE = True

PREEMPT = False

KILL = False

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= \$(DedicatedScheduler)

# Activate Dynamic slots configuration and slot partitioning

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 = auto

SLOT_TYPE_1_PARTITIONABLE = True

 

Replace $(hostname --short) with the network name of your central manager (CM).

In my setup, $NET_INT_PREFIX is the first 2 numbers of the IP range of the dedicated local network between the central manager and the execute nodes.

I use IDTOKEN security. https://htcondor.readthedocs.io/en/latest/admin-manual/security.html#highlights-of-new-features-in-version-9-0-0

 

This way, both MPI and serial jobs can be submitted and executed across all nodes, with MPI jobs being prioritized (as in, they canât be bumped during preemption), and with the CM releasing claimed dynamic partitioned slots if Idle for more than 20 seconds.

There might be better ways to configure this, but it gets the job done. :)

 

As for submit files and wrappers, they are unique to every R&D software we use. Although, Iâve only used Open MPI so far. My wrappers are modified versions of the openmpiscript example.

I havenât tried the MPICH examples (mp1script, mpi2script). I do not think thereâs an example file for MPI for Windows.

If you donât run jobs across multiple execute nodes, then as Greg mentioned, the vanilla universe might be simpler with MPI for Windows.

(vanilla universe does not accept machine_count in submit files)

 

Martin

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain via HTCondor-users
Sent: September 11, 2023 10:56 AM
To: htcondor-users@xxxxxxxxxxx
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] MPI on Windows

 

On 9/8/23 17:51, Sam.Dana@xxxxxxxxxxx wrote:

Looking at condor_config.local.dedicated.submit, in the statement, 

"If your dedicated resources are configured to only run jobs, you should probably set this attribute to '0'", 

does "only run jobs" mean "only run dedicated jobs" to correlate with Policy 1 in condor_config.local.dedicated.resource?

 

It does, but that's a small optimization.  To run parallel/dedicated jobs, I'd leave UNUSED_CLAIM_TIMEOUT

at the default value of 600 unless you have a good reason to change it, though.

 

Looking at condor_config.local.dedicated.resource, I wonder: 

      what settings are needed to run MPI apps within HTCondor on Windows?

 

Generally speaking, the most "High Throughput" way to run MPI jobs is to run a lot of

independent MPI jobs that each run on one node in your pool, perhaps on many cores on one node.

This can be done in the vanilla universe.  If you absolutely must run MPI jobs across multiple

nodes, then you will need to run the parallel universe.

 

To run MPI jobs on the parallel universe, you'll need scripts to bootstrap the MPI world.  To

be honest, I don't know of anyone who has done this on WIndows in quite some time, and

I don't know how up to date the examples we provide are with any modern version of

MPI for Windows.

 

 

-greg

 

 

Thanks,

Sam

 

NOTICE: This email message and all attachments transmitted with it may contain privileged and confidential information, and information that is protected by, and proprietary to, Parsons Corporation, and is intended solely for the use of the addressee for the specific purpose set forth in this communication. If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited, and you should delete this message and all copies and backups thereof. The recipient may not further distribute or use any of the information contained herein without the express written authorization of the sender. If you have received this message in error, or if you have any questions regarding the use of the proprietary information contained therein, please contact the sender of this message immediately, and the sender will provide you with further instructions.



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/