[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How to let the Condor work on Windows 7 system?



Dear all,

I am recommended to use the Condor for parallel work. I have a cluster with 10 nodes with name N01, N02, ..., N10. The IP address are 10.0.0.1, 10.0.0.2, ..., 10.0.0.10. On every node the Windows 7 64 bit and the latest Intel MPI are installed. I put the execute file to a folder and share the folder to all nodes.

On every node, I can run the program by:
mpiexec -n 24 \\n01\debug\test

and on the node n01, I can run the program by:
mpiexec -hosts 2 n02 24 n03 24 \\n01\debug\test
or
mpiexec -hosts 9 n02 24 n03 24 n04 24 n05 24 n06 24 n07 24 n08 24 n09 24 n10 24 \\n01\debug\test

However, when the program is put on n0i (for example, it is on \\n03\debug\de) and the mpiexec run on n0j (for example, run the mpiexec on n04), where i, j=1,2,..,10, then when hosts include any of n0i or n0j, the program will always hangs, however, if the n0i and n0j are not included in the hosts list, the program can run successfully.

Further tests show that when the size of program is larger, for example, when linked some library to let the file size increase from 900 KB to 16 MB, the following error appears when run the program by mpiexec:

op_read error on left context: Error = -1
unable to read the cmd header on the left context, Error = -1.
Error posting readv, An existing connection was forcibly closed by the remote host.(10054)


How to change the command line or configure of the cluster to let the mpiexec work successfully? If it is a known problem of MPI, can I let it work by the aid of Condor? How to let the Condor work for my problem by the step by step operations?

Thanks,
Zhanghong Tang