[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MPI issue with condor on Windows



Hello,

 

There is one problem for me regarding MPI (mpich2) with condor on Windows.

I followed the instructions from condor manual to set the configurations for parallel job. Following link also provides many useful information:

http://www.itk.org/Wiki/Proposals:Condor

 

Most things are fine, and the MPI job can be executed successfully with condor on single execute machine.

But there is one problem for job executing on two machines.

 

==== My Condor configurations =====

Condor version: 8.4.1

submit: submit machine, central manager

execute-1: execute machine 1, with 20 cpus

execute-2: execute machine 2, with 20 cpus

MPI: mpich2-1.4.1p1-x86-64

MPI application: app.exe

===============================

 

For instance, when I set machine_count = 30 in the parallel submit file. The 30 cpus are correctly claimed, e.g. 20 on execute-1 and 10 on execute-2.

But the job is only executed on execute-1 machine. There are 30 app.exe daemons one execute-1, and no this daemon on execute-2.

Given that there are only 20 cpus on execute-1.

The job is finished like this: 20 app.exe daemons are executed firstly, once there are free resource, the remaining 10 daemons begin to run.

 

On execute-2 machine, there are only 10 condor_starter daemons, no app.exe daemon.

 

I will appreciate very much if someone could give some help on this, and I have digged this problem for few days, but still failed.

 

If further information is needed, let me know. Thanks.

 

 

 

Best regards,

Linlin