[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception



Hi,

I have a problem with one specific job, and I need some help.

This job runs fine when started form a shell, on any machine (submit/execute)
This job fails when submitted from another submit machine, whoever the user is.
The submit machine can condor_submit some other jobs.

I'm using 6.7.18 on linux (debian sarge) machines, almost everythnig is on NFS.

Any help would be appreciated.
Thanks in advance
Nicolas

__________________________________________
Here is the submit file : 

#more longint.cmd
Universe = vanilla
Executable = /nfs/longint.sh
environment = LD_LIBRAIRY_PATH=/opt/intel_fc_80/lib
output = test.out
error  = test.err
log = test.log
notify_user = user@xxxxxxxxx
notification = error
Rank = Mips
queue
________________________________________________________
Then here is the test.log : 

001 (004.000.000) 12/15 17:30:59 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:30:59 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:01 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:01 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:03 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:03 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:06 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:06 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
009 (004.000.000) 12/15 17:34:56 Job was aborted by the user.
        via condor_rm (by user sacquin)
...
____________________________________________________________________
Here is the ShadowLog of the submit machine : 

12/15 17:15:28 ******************************************************
12/15 17:15:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/15 17:15:28 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow
12/15 17:15:28 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:15:28 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:15:28 ** PID = 11663
12/15 17:15:28 ******************************************************
12/15 17:15:28 Using config file: /nfs/condor/etc/condor_config
12/15 17:15:28 Using local config
files: /scratch/condor/condor_config.local 12/15 17:15:28 DaemonCore:
Command Socket at <XX.XX.XX.125:42705> 12/15 17:15:28 Initializing a
VANILLA shadow for job 628.0 12/15 17:15:29 (628.0) (11663): Request to
run on <XX.XX.XX.19:32770> was ACCEPTED 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(1) failed - errno 9 (Bad file descriptor) 12/15
17:15:29 (628.0) (11663): FileLock::obtain(2) failed - errno 9 (Bad
file descriptor) 12/15 17:15:29 (628.0) (11663): ERROR "Error from
starter on vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:29 (628.0) (11663): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:30 (631.0) (11651): FileLock::obtain(1) failed - errno 9 (Bad
file descriptor) 12/15 17:15:30 (631.0) (11651): FileLock::obtain(2)
failed - errno 9 (Bad file descriptor) 12/15 17:15:31
****************************************************** 12/15 17:15:31
** condor_shadow (CONDOR_SHADOW) STARTING UP 12/15 17:15:31
** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow 12/15 17:15:31
** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:15:31 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:15:31 ** PID = 11668 12/15
17:15:31 ****************************************************** 12/15
17:15:31 Using config file: /nfs/condor/etc/condor_config 12/15
17:15:31 Using local config files: /scratch/condor/condor_config.local
12/15 17:15:31 DaemonCore: Command Socket at <XX.XX.XX.125:42708> 12/15
17:15:31 Initializing a VANILLA shadow for job 628.0 12/15 17:15:31
(628.0) (11668): Request to run on <XX.XX.XX.19:32770> was ACCEPTED
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:31 (628.0) (11668): ERROR "Error from starter on
vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor)
root@merle:~#                     
__________________________________________________

Here is the StarterLog on the execute machine : 

12/15 17:47:43 ******************************************************
12/15 17:47:43 ** condor_starter (CONDOR_STARTER) STARTING UP
12/15 17:47:43 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter
12/15 17:47:43 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:47:43 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:47:43 ** PID = 21528
12/15 17:47:43 ******************************************************
12/15 17:47:43 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:43 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:43 DaemonCore:
Command Socket at <XX.XX.XX.19:42374> 12/15 17:47:43 Done setting
resource limits 12/15 17:47:43 Communicating with shadow
<XX.XX.XX.14:52436> 12/15 17:47:43 Submitting machine is
"chagall.my.domain" 12/15 17:47:43 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:43 IWD: /nfs
12/15 17:47:43 Output file: /nfs/test.out
12/15 17:47:43 Error file: /nfs/test.err
12/15 17:47:43 Renice expr "1" evaluated to 1
12/15 17:47:43 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:43 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:43 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:43 ShutdownFast all jobs. 12/15
17:47:45 ****************************************************** 12/15
17:47:45 ** condor_starter (CONDOR_STARTER) STARTING UP 12/15 17:47:45
** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter 12/15
17:47:45 ** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:47:45 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:47:45 ** PID = 21531
12/15 17:47:45 ******************************************************
12/15 17:47:45 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:45 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:45 DaemonCore:
Command Socket at <XX.XX.XX.19:42375> 12/15 17:47:45 Done setting
resource limits 12/15 17:47:45 Communicating with shadow
<XX.XX.XX.14:52438> 12/15 17:47:45 Submitting machine is
"chagall.my.domain" 12/15 17:47:45 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:45 IWD: /nfs
12/15 17:47:45 Output file: /nfs/test.out
12/15 17:47:45 Error file: /nfs/test.err
12/15 17:47:45 Renice expr "1" evaluated to 1
12/15 17:47:45 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:45 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:45 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:45 ShutdownFast all jobs.
couronnes:~#                                   


Thanks again for being courageous enough to read until here :)

----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------