[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] contact list too short for parallel universe jobs
- Date: Wed, 16 Sep 2009 09:12:29 +0200
- From: Henning Fehrmann <henning.fehrmann@xxxxxxxxxx>
- Subject: [Condor-users] contact list too short for parallel universe jobs
We user Condor v7.2, MPICH and the Condor native sshd.sh script.
We triggered another problem with our MPI <-> Condor jobs.
The contact list is too short.
We tried to collect 8 nodes to launch a MPI job. The machine_count is set to 8 in the
On the submit host I found:
9/15 17:17:42 (pid:819) Starting add_shadow_birthdate(964622.0)
9/15 17:17:42 (pid:819) Started shadow for job 964622.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler, (shadow pid = 5495)
9/15 17:23:15 (pid:819) Starting add_shadow_birthdate(964622.0)
9/15 17:23:15 (pid:819) Started shadow for job 964622.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler, (shadow pid = 15929)
So everything seems to be fine.
In the StartLog of the root node I found that the job started and, in fact, it is running.
9/15 17:23:17 slot3: Got activate_claim request from shadow (<10.20.30.1:44875>)
9/15 17:23:17 slot3: Remote job ID is 964622.0
9/15 17:23:17 slot3: Got universe "PARALLEL" (11) from request classad
9/15 17:23:17 slot3: State change: claim-activation protocol successful
9/15 17:23:17 slot3: Changing activity: Idle -> Busy
On the other hand the contact list on the submit host only contains 6 nodes:
0 n1606 4444 user /local/condor.n1606/execute/dir_8332
4 n1643 4444 user /local/condor.n1643/execute/dir_18030
1 n1608 4444 user /local/condor.n1608/execute/dir_28160
7 n1659 4444 user /local/condor.n1659/execute/dir_21231
5 n1670 4445 user /local/condor.n1670/execute/dir_10512
6 n1629 4445 user /local/condor.n1629/execute/dir_15525
, hence, the MPI jobs tries to use 8 nodes but has only 6 available and hangs.
Also only 6 ssh-keys have been generated.
I have another question which might be completely unrelated to the former problem.
What happens if a node offers two free slots for a parallel universe job?
Are two keys generated for this node? If it is so, which one is used?
Thank you and cheers,