[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI job problem

young董 wrote:

>I'm having a problem using the MPI universe.
>My condor run's just fine on vanila universe,
>but problem comes when using the MPI universe.
>The job will finish,
>but the output is not right.
>here are the outputs of the two output files:
>[condor@hiroyuki 9-simplempijap]$ cat out.0
>p0_7107:  p4_error: Timeout in making connection to
>remote process on hiroyuki.hiroyuki4: 0
>p0_7107: (302.014591) net_send: could not write to
>fd=4, errno = 32
>[condor@hiroyuki 9-simplempijap]$ cat out.1
>rm_23723:  p4_error: Could not gethostbyname for host
>hiroyuki.hiroyuki2; may be invalid name
>: 61
Not sure this is THE main reason, but:
1. the above error line indicates a name resolving issue.
One of the network systems condor relies on is the name resolving.
you should setup an environment that allows direct and reversed lookups
( either by /etc/hosts, or DNS server )
2. Besides, NOTE, you have names that indicate different DNS domains:
hiroyuki.hiroyuki2 and hiroyuki.hiroyuki4 are in different domains.
if it were:
it would be much better.

3. And, if you have multiple interfaces on the machines, you must
specify the interface you want condor to use with NETWORK_INTERFACE


>First i thought the problem will be the version
>of mpich, so i downloaded:
>the problem stays the same.
>I have three machines.
>I really need some help, thank you.
>p.s. I am having the same trouble 
>this guy is having.
>Condor-users mailing list