[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor & mpi



I have been trying to use mpi through condor. The following error occur in the
log of the job submit:

000 (4406.000.000) 07/12 15:29:26 Job submitted from host: <10.10.10.1:60284>
...
014 (4406.000.000) 07/12 15:29:33 Node 0 executing on host:
<10.10.10.50:36703?CCBID=144.122.72.10:9618#1232>
...
001 (4406.000.000) 07/12 15:29:33 Job executing on host: MPI_job
...
007 (4406.000.000) 07/12 15:29:33 Shadow exception!
        Error from slot7@atmaca50: File
/var/condor/spool/cluster4406.proc0.subproc0/0.key maps to url
local:/var/condor/spool/cluster4406.proc0.subproc0/0.key, which I don't know how
to open.

        0  -  Run Bytes Sent By Job
        2703  -  Run Bytes Received By Job


condor configuration works for serial jobs (vanilla, standard). it also works
properly in parallel universe with simple jobs (see the "sleep" or "cat" example
in the user manual). however a problem persists with mpi jobs. i compiled a
simple "hello world" code with lam's mpif77 compiler, and configured wrapper
lamscript accordingly. First, I thought there is  a permission issue with
/var/condor/spool directory. Did not work though as i made it 777 mode.

I am going crazy because of this problem. Help please!
Thanks in advance