Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] How to troubleshoot MPI job
- Date: Tue, 15 Feb 2005 17:04:45 +0800
- From: Nigel Teow <nigelt@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] How to troubleshoot MPI job
Hi,
Had installed condor (version 6.6.8) on a cluster,
Am able to use condor_submit to run the mpi job on a single node but
when I tried to run on 2 nodes, it fails. Following are the output files,
outfile.0
-----------
p0_28434: p4_error: Child process exited while making connection to
remote process on compute-0-1.local: 0
p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
outfile.1
-----------
rm_28438: (-) net_recv failed for fd = 3
rm_28438: p4_error: net_recv read, errno = : 104
Following is the submit script ran,
##################################
# Condor submit description file
##################################
universe = MPI
executable = hello-world
log = logfile
input = /dev/null
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
queue
Am able to run the hello-world program directly using mpirun.
Would appreciate if anyone could advice on how I could troubleshoot
this, thanks in advance.
Nigel
--
Nigel Teow
Bioinformatics Institute