[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MPICH job run on different machines



Hi,
  I am using the condor to submit a MPI job needing 16 slot. It's ok that using the mpiexec to run the job on two machines. But if I use mp1script to tun it, condor gives the following errors:
  10.1.1.103 no such file or directory /tmp/var/condor/execute/dir_2109562
 The following is the contact file:
8 10.1.1.103 4444 condor /tmp/var/condor/execute/dir_3338933 1460387177
0 node70 4444 condor /tmp/var/condor/execute/dir_2109562 1460387177
7 node70 4445 condor /tmp/var/condor/execute/dir_2109566 1460387177
4 10.1.1.103 4445 condor /tmp/var/condor/execute/dir_3338930 1460387177
3 node70 4446 condor /tmp/var/condor/execute/dir_2109564 1460387177
14 10.1.1.103 4446 condor /tmp/var/condor/execute/dir_3338936 1460387177
1 node70 4447 condor /tmp/var/condor/execute/dir_2109563 1460387177
12 10.1.1.103 4447 condor /tmp/var/condor/execute/dir_3338935 1460387177
11 node70 4448 condor /tmp/var/condor/execute/dir_2109568 1460387177!
10 10.1.1.103 4448 condor /tmp/var/condor/execute/dir_3338934 1460387177
6 10.1.1.103 4449 condor /tmp/var/condor/execute/dir_3338932 1460387177
9 node70 4449 condor /tmp/var/condor/execute/dir_2109567 1460387177
5 node70 4450 condor /tmp/var/condor/execute/dir_2109565 1460387177
2 10.1.1.103 4450 condor /tmp/var/condor/execute/dir_3338929 1460387177
15 10.1.1.103 4451 condor /tmp/var/condor/execute/dir_3338938 1460387177
13 node70 4451 condor /tmp/var/condor/execute/dir_2109569 1460387177

 It means the mpich app must be in the same dir on different machines ? So how to solve it ?
Thanks,
HaozhanW