There is one problem for me regarding MPI (mpich2) with condor on Windows.
I followed the instructions from condor manual to set the configurations for parallel job. Following link also provides many useful information:
Most things are fine, and the MPI job can be executed successfully with condor on single execute machine.
But there is one problem for job executing on two machines.
==== My Condor configurations =====
Condor version: 8.4.1
submit: submit machine, central manager
execute-1: execute machine 1, with 20 cpus
execute-2: execute machine 2, with 20 cpus
MPI application: app.exe
For instance, when I set machine_count = 30 in the parallel submit file. The 30 cpus are correctly claimed, e.g. 20 on execute-1 and 10 on execute-2.
But the job is only executed on execute-1 machine. There are 30 app.exe daemons one execute-1, and no this daemon on execute-2.
Given that there are only 20 cpus on execute-1.
The job is finished like this: 20 app.exe daemons are executed firstly, once there are free resource, the remaining 10 daemons begin to run.
On execute-2 machine, there are only 10 condor_starter daemons, no app.exe daemon.
I will appreciate very much if someone could give some help on this, and I have digged this problem for few days, but still failed.
If further information is needed, let me know. Thanks.