[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI jobs in parallel universe



Hello,

we started MPI jobs using the parallel universe of Condor v7.2.
On the clients the script $CONDOR_PATH/condor-7.2.4/libexec/sshd.sh
is being used.
The submit host creates a contact list which is fetched by the
root node which starts the job.

We do not have much experience with MPI jobs and Condor yet and the
creation of the contact list (8 nodes) works usually.

We observed now a problem the first time which might be a Condor
issue. The first part of the contact list contained a almost complete
set of nodes - only one was missing. Ca. 6h later a complete set of
other nodes has been added to this list. This list contains now
two sets of nodes (15) where one node is missing of the first set.
The list now looks:

3 n1610 4444 user /local/condor.n1610/execute/dir_30754
1 n1609 4444 user /local/condor.n1609/execute/dir_3377
5 n1634 4444 user /local/condor.n1634/execute/dir_16580
0 n1606 4444 user /local/condor.n1606/execute/dir_1836
6 n1618 4444 user /local/condor.n1618/execute/dir_3399
2 n1610 4445 user /local/condor.n1610/execute/dir_30753
4 n1610 4446 user /local/condor.n1610/execute/dir_30760
5 n1623 4444 user /local/condor.n1623/execute/dir_12669
0 n1601 4444 user /local/condor.n1601/execute/dir_3838
1 n1603 4444 user /local/condor.n1603/execute/dir_8032
7 n1632 4444 user /local/condor.n1632/execute/dir_21320
2 n1605 4444 user /local/condor.n1605/execute/dir_30720
6 n1642 4444 user /local/condor.n1642/execute/dir_14027
3 n1609 4445 user /local/condor.n1609/execute/dir_19201
4 n1610 4447 user /local/condor.n1610/execute/dir_14450

ssh-keys of the last 8 nodes have been located in the spool directory.
The root node was not able to start the jobs successfully, likely due to 
missing keys of the first set of 7 nodes.

I found this is the logs of the user:

014 (946522.000.006) 09/14 17:57:21 Node 6 executing on host: <10.10.16.18:33045>
...
014 (946522.000.005) 09/14 17:57:21 Node 5 executing on host: <10.10.16.34:39853>
...
014 (946522.000.001) 09/14 23:44:15 Node 1 executing on host: <10.10.16.3:34512>
...
014 (946522.000.003) 09/14 23:44:15 Node 3 executing on host: <10.10.16.9:52771>
...
014 (946522.000.004) 09/14 23:44:15 Node 4 executing on host: <10.10.16.10:59493>
...
014 (946522.000.002) 09/14 23:44:15 Node 2 executing on host: <10.10.16.5:38208>
...
014 (946522.000.000) 09/14 23:44:15 Node 0 executing on host: <10.10.16.1:40642>
...
014 (946522.000.005) 09/14 23:44:16 Node 5 executing on host: <10.10.16.23:39523>
...
014 (946522.000.007) 09/14 23:44:16 Node 7 executing on host: <10.10.16.32:42198>
...
014 (946522.000.006) 09/14 23:44:16 Node 6 executing on host: <10.10.16.42:60415>
...
001 (946522.000.000) 09/14 23:44:16 Job executing on host: MPI_job



I also found this in the SchedLog of the submit host:

9/14 17:57:18 (pid:7728) Starting add_shadow_birthdate(946522.0)
9/14 17:57:18 (pid:7728) Started shadow for job 946522.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler,(shadow pid = 2923)

9/14 23:35:03 (pid:740) 946522.0: JobLeaseDuration remaining: EXPIRED!
9/14 23:44:13 (pid:819) Starting add_shadow_birthdate(946522.0)
9/14 23:44:13 (pid:819) Started shadow for job 946522.0 on slot2@xxxxxxxxxxxxxxxxx <10.10.16.1:40642> for DedicatedScheduler, (shadow pid = 5955)
9/15 09:50:12 (pid:819) Attempting to chown '/local/condor.atlas1/spool/cluster946522.proc0.subproc0/.contact.swp' from 5012 to 666.666, but the path was unexpectedly owned by 0


Even if we increase the JobLeaseDuration it will probably not solve the problem with the
too long contact list.

Any ideas?

Thank you and cheers,
Henning