[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs in parallel universe



Hi Henning,

This statement is at least one source of your problem:

9/15 09:50:12 (pid:819) Attempting to chown '/local/condor.atlas1/spool/cluster946522.proc0.subproc0/.contact.swp' from 5012 to 666.666, but the path was unexpectedly owned by 0

I believe that all of those /local/condor*/ paths should be owned by condor, not root (0).

This usually happens when some part of Condor is launched by the root user, rather than user Condor.

Other, more knowledgeable folks will undoubtedly chime in with more useful info, but that's my take.

Thanks

James Burnash

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Henning Fehrmann
Sent: Tuesday, September 15, 2009 5:21 AM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] MPI jobs in parallel universe

Hello,

we started MPI jobs using the parallel universe of Condor v7.2.
On the clients the script $CONDOR_PATH/condor-7.2.4/libexec/sshd.sh
is being used.
The submit host creates a contact list which is fetched by the
root node which starts the job.

We do not have much experience with MPI jobs and Condor yet and the
creation of the contact list (8 nodes) works usually.

We observed now a problem the first time which might be a Condor
issue. The first part of the contact list contained a almost complete
set of nodes - only one was missing. Ca. 6h later a complete set of
other nodes has been added to this list. This list contains now
two sets of nodes (15) where one node is missing of the first set.
The list now looks:

3 n1610 4444 user /local/condor.n1610/execute/dir_30754
1 n1609 4444 user /local/condor.n1609/execute/dir_3377
5 n1634 4444 user /local/condor.n1634/execute/dir_16580
0 n1606 4444 user /local/condor.n1606/execute/dir_1836
6 n1618 4444 user /local/condor.n1618/execute/dir_3399
2 n1610 4445 user /local/condor.n1610/execute/dir_30753
4 n1610 4446 user /local/condor.n1610/execute/dir_30760
5 n1623 4444 user /local/condor.n1623/execute/dir_12669
0 n1601 4444 user /local/condor.n1601/execute/dir_3838
1 n1603 4444 user /local/condor.n1603/execute/dir_8032
7 n1632 4444 user /local/condor.n1632/execute/dir_21320
2 n1605 4444 user /local/condor.n1605/execute/dir_30720
6 n1642 4444 user /local/condor.n1642/execute/dir_14027
3 n1609 4445 user /local/condor.n1609/execute/dir_19201
4 n1610 4447 user /local/condor.n1610/execute/dir_14450

ssh-keys of the last 8 nodes have been located in the spool directory.
The root node was not able to start the jobs successfully, likely due to
missing keys of the first set of 7 nodes.

I found this is the logs of the user:

014 (946522.000.006) 09/14 17:57:21 Node 6 executing on host: <10.10.16.18:33045>
...
014 (946522.000.005) 09/14 17:57:21 Node 5 executing on host: <10.10.16.34:39853>
...
014 (946522.000.001) 09/14 23:44:15 Node 1 executing on host: <10.10.16.3:34512>
...
014 (946522.000.003) 09/14 23:44:15 Node 3 executing on host: <10.10.16.9:52771>
...
014 (946522.000.004) 09/14 23:44:15 Node 4 executing on host: <10.10.16.10:59493>
...
014 (946522.000.002) 09/14 23:44:15 Node 2 executing on host: <10.10.16.5:38208>
...
014 (946522.000.000) 09/14 23:44:15 Node 0 executing on host: <10.10.16.1:40642>
...
014 (946522.000.005) 09/14 23:44:16 Node 5 executing on host: <10.10.16.23:39523>
...
014 (946522.000.007) 09/14 23:44:16 Node 7 executing on host: <10.10.16.32:42198>
...
014 (946522.000.006) 09/14 23:44:16 Node 6 executing on host: <10.10.16.42:60415>
...
001 (946522.000.000) 09/14 23:44:16 Job executing on host: MPI_job



I also found this in the SchedLog of the submit host:

9/14 17:57:18 (pid:7728) Starting add_shadow_birthdate(946522.0)
9/14 17:57:18 (pid:7728) Started shadow for job 946522.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.16.6:32809> for DedicatedScheduler,(shadow pid = 2923)

9/14 23:35:03 (pid:740) 946522.0: JobLeaseDuration remaining: EXPIRED!
9/14 23:44:13 (pid:819) Starting add_shadow_birthdate(946522.0)
9/14 23:44:13 (pid:819) Started shadow for job 946522.0 on slot2@xxxxxxxxxxxxxxxxx <10.10.16.1:40642> for DedicatedScheduler, (shadow pid = 5955)
9/15 09:50:12 (pid:819) Attempting to chown '/local/condor.atlas1/spool/cluster946522.proc0.subproc0/.contact.swp' from 5012 to 666.666, but the path was unexpectedly owned by 0


Even if we increase the JobLeaseDuration it will probably not solve the problem with the
too long contact list.

Any ideas?

Thank you and cheers,
Henning
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com