[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception mpi jobs



Hi all,

I have a problem with condor and MPI, when I send a mpi job, It is executed but when it is finished goes to the queue again,the state in that case is "Idle". Our cluster is distributed, and mpi are in local (are installed in each node). The outputs log is :

007 (043.000.000) 07/26 13:50:19 Shadow exception!
       UserPolicy Error: No signal/exit codes in job ad!
       125  -  Run Bytes Sent By Job
       32120  -  Run Bytes Received By Job

But the results are correct!!

I read this in the manual:

*Event Number:* 007
*Event Name:* Shadow exception
*Event Description:* The /condor_ shadow/, a program on the submit computer that watches over the job and performs some services for the job, failed for some catastrophic reason. The job will leave the machine and go back into the queue.

My classad is very simply :

CLASSAD

universe = parallel
executable = script
arguments = nodes a.out

Scheduler="DedicatedScheduler@XXXXXX"
machine_count=5

log = s.log
output = s.out
error = s.err
transfer_input_files = hello.c,a.out,nodes

should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT

queue

#############################

SCRIPT
#!/bin/bash
export LD_LIBRARY_PATH=/opt/openmpi/lib

/opt/openmpi/bin/mpirun --mca ssh /usr/local/condor/libexec/condor_ssh --hostfile $1 -np 5 $2

#############################


I don't know how I can resolve the shadow exception. Any ideas?

Thanks

Regards!
--

Ana Silva		
Sistemas y Supercomputación
Centro Informático Científico de Andalucía (CICA)
Avda. Reina Mercedes s/n - 41012 - Sevilla (Spain) Tfno.: +34 955 056 600 / +34 955 056 632 / FAX: +34 955 056 650
Consejería de Innovación, Ciencia y Empresa
Junta de Andalucía
---------------------------------------------------
Portal de E-Ciencia de Andalucía
http://eciencia.cica.es
http://supercomputacion.cica.es
---------------------------------------------------
Este mensaje esta firmado digitalmente. Para poder
reconocer la firma desde su cliente debera tener
instalado el certificado raiz de la CA del CICA en
el mismo. Puede descargarlo desde:

http://pki.cica.es/cacert/
---------------------------------------------------

begin:vcard
fn:Ana Silva
n:Silva;Ana
org;quoted-printable:Centro Inf=C3=B3rmatico Cient=C3=ADfico de Andaluc=C3=ADa
adr;dom:;;www.cica.es
email;internet:asilva@xxxxxxx
tel;work:955056632 
version:2.1
end:vcard