[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to troubleshoot MPI job?



Hi Erik,

thanks for your reply,

had tried what you mentioned, had installed mpich version 1.2.4 for device ch_p4, it isnt working, the thing is when i submit with machine_count = 3 it would not run and just get "stuck" in the job queue.

when i use machine_count = 2 it would run and return the result but the job just ran on the same machine.

another mailing subject "Condor with MPI-over-ssh" faces the same issue as mine.

My qns are,
1. How do I check which mpich Condor is using to run the job?
2. What does Condor need from mpich to run the mpi job?
3. Is there a way to monitor/observe how Condor runs the mpi job?
4. How does Condor know which machines to use?

Or is there anything else that I need to make sure in order for condor to be able to run mpi job?

thanks,
Nigel

Erik Paulson wrote:

On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:


Hi,

Had installed condor (version 6.6.8) on a cluster,

Am able to use condor_submit to run the mpi job on a single node but when I tried to run on 2 nodes, it fails. Following are the output files,

outfile.0
-----------
p0_28434: p4_error: Child process exited while making connection to remote process on compute-0-1.local: 0
p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32


outfile.1
-----------
rm_28438: (-) net_recv failed for fd = 3
rm_28438:  p4_error: net_recv read, errno = : 104




That looks like an error with MPICH 1.2.5 or later. Use 1.2.4

-Erik
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users