[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] contact list too short for parallel universe jobs



Hi,

We user Condor v7.2, MPICH and the Condor native sshd.sh script.

We triggered another problem with our MPI <-> Condor jobs.

The contact list is too short.

We tried to collect 8 nodes to launch a MPI job. The machine_count is set to 8 in the 
submit file.

On the submit host I found:

9/15 17:17:42 (pid:819) Starting add_shadow_birthdate(964622.0)
9/15 17:17:42 (pid:819) Started shadow for job 964622.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler, (shadow pid = 5495)
:
9/15 17:23:15 (pid:819) Starting add_shadow_birthdate(964622.0)
9/15 17:23:15 (pid:819) Started shadow for job 964622.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler, (shadow pid = 15929)

So everything seems to be fine.

In the StartLog of the root node I found that the job started and, in fact, it is running.

9/15 17:23:17 slot3: Got activate_claim request from shadow (<10.20.30.1:44875>)
9/15 17:23:17 slot3: Remote job ID is 964622.0
9/15 17:23:17 slot3: Got universe "PARALLEL" (11) from request classad
9/15 17:23:17 slot3: State change: claim-activation protocol successful
9/15 17:23:17 slot3: Changing activity: Idle -> Busy


On the other hand the contact list on the submit host only contains 6 nodes:
0 n1606 4444 user /local/condor.n1606/execute/dir_8332
4 n1643 4444 user /local/condor.n1643/execute/dir_18030
1 n1608 4444 user /local/condor.n1608/execute/dir_28160
7 n1659 4444 user /local/condor.n1659/execute/dir_21231
5 n1670 4445 user /local/condor.n1670/execute/dir_10512
6 n1629 4445 user /local/condor.n1629/execute/dir_15525

, hence, the MPI jobs tries to use 8 nodes but has only 6 available and hangs.

Also only 6 ssh-keys have been generated.

Any hints?


I have another question which might be completely unrelated to the former problem.
What happens if a node offers two free slots for a parallel universe job?
Are two keys generated for this node? If it is so, which one is used? 


Thank you and cheers,
Henning