[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI job problem



young董 wrote:

>Hi,
>I'm having a problem using the MPI universe.
>
>My condor run's just fine on vanila universe,
>but problem comes when using the MPI universe.
>
>The job will finish,
>but the output is not right.
>here are the outputs of the two output files:
>
>[condor@hiroyuki 9-simplempijap]$ cat out.0
>p0_7107:  p4_error: Timeout in making connection to
>remote process on hiroyuki.hiroyuki4: 0
>p0_7107: (302.014591) net_send: could not write to
>fd=4, errno = 32
>
>[condor@hiroyuki 9-simplempijap]$ cat out.1
>rm_23723:  p4_error: Could not gethostbyname for host
>hiroyuki.hiroyuki2; may be invalid name
>: 61
>  
>
Not sure this is THE main reason, but:
1. the above error line indicates a name resolving issue.
One of the network systems condor relies on is the name resolving.
you should setup an environment that allows direct and reversed lookups
( either by /etc/hosts, or DNS server )
2. Besides, NOTE, you have names that indicate different DNS domains:
hiroyuki.hiroyuki2 and hiroyuki.hiroyuki4 are in different domains.
if it were:
hiroyuki2.hiroyuki
hiroyuki4.hiroyuki
it would be much better.

3. And, if you have multiple interfaces on the machines, you must
specify the interface you want condor to use with NETWORK_INTERFACE
directive.

Max.


>First i thought the problem will be the version
>of mpich, so i downloaded:
>mpich-1.2.2.1.tar.gz  
>mpich-1.2.4.tar.gz
>
>the problem stays the same.
>I have three machines.
>
>I really need some help, thank you.
>
>p.s. I am having the same trouble 
>https://lists.cs.wisc.edu/archive/condor-users/2005-February/msg00263.shtml
>this guy is having.
>
>__________________________________________________
>想即時收到新信通知?
>馬上下載Yahoo!奇摩即時通訊 
>http://messenger.yahoo.com.tw/
>_______________________________________________
>Condor-users mailing list
>Condor-users@xxxxxxxxxxx
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>  
>