[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] çå: MPI issue with condor on Windows



Hi,

 

Given that there is no response these days, I would try to provide more information you may want to look at.

 

------------------------------------  condor.config ------------------------------

CONDOR_HOST = sctmc.ctmc.com

COLLECTOR_NAME = CTMC

UID_DOMAIN = $(CONDOR_HOST)

CONDOR_ADMIN =

SMTP_SERVER =

ALLOW_READ = *

ALLOW_WRITE = *

ALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(IP_ADDRESS)

 

CREDD_HOST = sctmc.ctmc.com

STARTER_ALLOW_RUNAS_OWNER = True

CREDD_CACHE_LOCALLY = True

SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD

ALLOW_CONFIG = zhanglinlin1@ctmc

 

START = FALSE

WANT_VACATE = FALSE

WANT_SUSPEND = TRUE

DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR

 

BIND_ALL_INTERFACES = FALSE

--------------------------------------------------------------------------------------

This is for central manager. For working nodes, it is similar except the daemon related lines.

 

 

---------------------  condor.config.local only show parallel settings -----------

#SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe

#SMPD_SERVER_ARGS = -p 6666 -d

#SMPD_SERVER_LOG = $(LOG)\SmpdLog

 

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxx"

 

MPI_CONDOR_RSH_PATH = $(LIBEXEC)

 

START                             = True

SUSPEND            = False

CONTINUE     = True

PREEMPT            = False

KILL                   = False

WANT_SUSPEND = False

WANT_VACATE        = False

RANK                = Scheduler =?= $(DedicatedScheduler)

 

#DAEMON_LIST = $(DAEMON_LIST), SMPD_SERVER

--------------------------------------------------------------------------------------------

I also tried to uncommented the lines with  respect to SMPD service in condor.config.local file, but this did not solve the problem.

 

---------------------- submit file ------------------------------------------

universe = parallel

executable = mp2script.bat

arguments = \\sctmc\d\condor\myapp.exe

machine_count = 30

output = parallel_out.$(NODE).log

error  = parallel_error.$(NODE).log

log    = parallel_log.$(NODE).log

should_transfer_files   = yes

when_to_transfer_output = on_exit

 

run_as_owner = True

 

queue

------------------------------------------------------------------------------

 

There is another problem, the produced log file name is parallel_log.#pArAlLeLnOdE#, which is not correct.  I did not find errors about this in condor log files.

 

Any suggestions ?

 

 

Thanks,

Linlin

 

 

 

发件人: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] 代表 张琳琳1
发送时间: 2015115 16:33
收件人: HTCondor-Users Mail List
主题: [HTCondor-users] MPI issue with condor on Windows

 

Hello,

 

There is one problem for me regarding MPI (mpich2) with condor on Windows.

I followed the instructions from condor manual to set the configurations for parallel job. Following link also provides many useful information:

http://www.itk.org/Wiki/Proposals:Condor

 

Most things are fine, and the MPI job can be executed successfully with condor on single execute machine.

But there is one problem for job executing on two machines.

 

==== My Condor configurations =====

Condor version: 8.4.1

submit: submit machine, central manager

execute-1: execute machine 1, with 20 cpus

execute-2: execute machine 2, with 20 cpus

MPI: mpich2-1.4.1p1-x86-64

MPI application: app.exe

===============================

 

For instance, when I set machine_count = 30 in the parallel submit file. The 30 cpus are correctly claimed, e.g. 20 on execute-1 and 10 on execute-2.

But the job is only executed on execute-1 machine. There are 30 app.exe daemons one execute-1, and no this daemon on execute-2.

Given that there are only 20 cpus on execute-1.

The job is finished like this: 20 app.exe daemons are executed firstly, once there are free resource, the remaining 10 daemons begin to run.

 

On execute-2 machine, there are only 10 condor_starter daemons, no app.exe daemon.

 

I will appreciate very much if someone could give some help on this, and I have digged this problem for few days, but still failed.

 

If further information is needed, let me know. Thanks.

 

 

 

Best regards,

Linlin