[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI Parallel submission in v7.6.2



Hello,

So vanilla jobs are running on my upgraded (to 7.6.2) cluster, but
parallel universe MPI jobs are another thing. I've brought across all
the settings from previous local configs including the specification of
a dedicated scheduler (on the head node) and a policy for dedicated
jobs, all of which worked before.

Using a previously successful submit file the job generates the
following 

*JOB LOG
007 (015.000.000) 08/12 17:39:31 Shadow exception!
	Error from slot2@xxxxxxxxxxxxxxxxx: StreamHandler: stdout:
couldn't lseek on mpi_clustal_200.2.output.out to 0: No such file or
directory

*HEAD NODE : SHADOWLOG 
08/12/11 17:39:31 (15.0) (24944): ERROR "Error from
slot2@xxxxxxxxxxxxxxxxx: StreamHandler: stdout: couldn't lseek on
mpi_clustal_200.2.output.out to 0: No such file or directory
" at line 676 in file
/home/condor/execute/dir_24541/userdir/src/condor_shadow.V6.1/pseudo_ops
.cpp

*COMPUTE-1-0 : STARTERLOG.SLOT2
08/12/11 17:39:30 Error file:
/state/partition1/condor/execute/dir_15095/mpi_clustal_200.2.error.out
08/12/11 17:39:30 About to exec
/state/partition1/condor/execute/dir_15095/condor_exec.exe
/share/apps/clustalw-mpi -infile=200_contigs.fas
08/12/11 17:39:30 Create_Process succeeded, pid=15123
08/12/11 17:39:31 IOProxy: accepting connection from 10.1.255.254
08/12/11 17:39:31 IOProxyHandler: closing connection to 10.1.255.254
08/12/11 17:39:31 ERROR "StreamHandler: stdout: couldn't lseek on
mpi_clustal_200.2.output.out to 0: No such file or directory
" at line 128 in file
/home/condor/execute/dir_24541/userdir/src/condor_starter.V6.1/stream_ha
ndler.cpp
08/12/11 17:39:31 ShutdownFast all jobs.

*ON COMPUTE-1-0
ls ../execute/dir_15095
ls: ../execute/dir_15095: No such file or directory

So the directory that's trying to be read from doesn't exist! All other
execute/dir_1509* directories exist on both nodes that are set up for
parallel processing, and the problem can occur on either node
(compute-1-0 or compute-1-1). I'll admit that the inner workings of a
parallel job aren't clear to me... has this directory been moved as part
of the regular process, was it deleted too early or even created in the
first place?



I'm also getting issues with user priority during parallel MPI jobs that
didn't come up before. Is the <none> user supposed to be there? It
doesn't ring a bell from any previous ncarnations of condor. Possibly
related to the above issue...

condor_q 15 -better-analyze
-- Submitter: queen.bioinformatics : <xxx.xxx.xxx.xx:35650> :
queen.bioinformatics
---
015.000:  Run analysis summary.  Of 17 machines,
      0 are rejected by your job's requirements 
      1 reject your job because of their own requirements 
     10 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      6 are available to run your job
        Last successful match: Fri Aug 12 17:46:31 2011

queen1[MPI_clustal]> condor_userprio -all
Last Priority Update:  8/12 17:46
                                    Effective   Real     Priority   Res
Total Usage       Usage            Last      
User Name                           Priority  Priority    Factor    Used
(wghted-hrs)    Start Time       Usage Time   
------------------------------      --------- -------- ------------ ----
----------- ---------------- ----------------
DedicatedScheduler@xxxxxxxxxxx           0.50     0.50         1.00    0
174.22  4/21/2009 17:14  8/12/2011 17:46
steve@bioinformatics                     0.69     0.69         1.00   10
1627.34  2/05/2009 16:45  8/12/2011 17:46
<none>                                   0.71     0.71         1.00   10
10.27  8/11/2011 11:07  8/12/2011 17:46
------------------------------      --------- -------- ------------ ----
----------- ---------------- ----------------
Number of users: 2                                                    10
1801.56  2/05/2009 16:45  8/11/2011 17:47

I appreciate the help that you've already given. 

Steve
-----------------------------------------
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of
the named addressee(s). It may not be disclosed to any other person
without the express authority of the HPA, or the intended
recipient, or both. If you are not the intended recipient, you must
not disclose, copy, distribute or retain this message or any part
of it. This footnote also confirms that this EMail has been swept
for computer viruses, but please re-sweep any attachments before
opening or saving. HTTP://www.HPA.org.uk
**************************************************************************