[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow exception



Just a thought but the following error:

condor_exec.exe': Exec format error" 

Can occur when a binary has been compiled for a different architecture, are all of your Condor nodes running a Linux based OS and were the binaries compiled for Linux?

Also, your StarterLog on the execute machine is locating resources from NFS and is seeing a "Permission Denied" error, could there be configuration issues between machines and your NFS server?  e.g. the clients that run the job have access to the NFS and the problematic ones dont?

Just a few thoughts...

Shaun






-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx on behalf of Nicolas GUIOT
Sent: Fri 12/15/2006 5:01 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Shadow exception
 
Hi,

I have a problem with one specific job, and I need some help.

This job runs fine when started form a shell, on any machine (submit/execute)
This job fails when submitted from another submit machine, whoever the user is.
The submit machine can condor_submit some other jobs.

I'm using 6.7.18 on linux (debian sarge) machines, almost everythnig is on NFS.

Any help would be appreciated.
Thanks in advance
Nicolas

__________________________________________
Here is the submit file : 

#more longint.cmd
Universe = vanilla
Executable = /nfs/longint.sh
environment = LD_LIBRAIRY_PATH=/opt/intel_fc_80/lib
output = test.out
error  = test.err
log = test.log
notify_user = user@xxxxxxxxx
notification = error
Rank = Mips
queue
________________________________________________________
Then here is the test.log : 

001 (004.000.000) 12/15 17:30:59 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:30:59 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:01 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:01 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:03 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:03 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:06 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:06 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
009 (004.000.000) 12/15 17:34:56 Job was aborted by the user.
        via condor_rm (by user sacquin)
...
____________________________________________________________________
Here is the ShadowLog of the submit machine : 

12/15 17:15:28 ******************************************************
12/15 17:15:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/15 17:15:28 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow
12/15 17:15:28 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:15:28 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:15:28 ** PID = 11663
12/15 17:15:28 ******************************************************
12/15 17:15:28 Using config file: /nfs/condor/etc/condor_config
12/15 17:15:28 Using local config
files: /scratch/condor/condor_config.local 12/15 17:15:28 DaemonCore:
Command Socket at <XX.XX.XX.125:42705> 12/15 17:15:28 Initializing a
VANILLA shadow for job 628.0 12/15 17:15:29 (628.0) (11663): Request to
run on <XX.XX.XX.19:32770> was ACCEPTED 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(1) failed - errno 9 (Bad file descriptor) 12/15
17:15:29 (628.0) (11663): FileLock::obtain(2) failed - errno 9 (Bad
file descriptor) 12/15 17:15:29 (628.0) (11663): ERROR "Error from
starter on vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:29 (628.0) (11663): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:30 (631.0) (11651): FileLock::obtain(1) failed - errno 9 (Bad
file descriptor) 12/15 17:15:30 (631.0) (11651): FileLock::obtain(2)
failed - errno 9 (Bad file descriptor) 12/15 17:15:31
****************************************************** 12/15 17:15:31
** condor_shadow (CONDOR_SHADOW) STARTING UP 12/15 17:15:31
** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow 12/15 17:15:31
** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:15:31 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:15:31 ** PID = 11668 12/15
17:15:31 ****************************************************** 12/15
17:15:31 Using config file: /nfs/condor/etc/condor_config 12/15
17:15:31 Using local config files: /scratch/condor/condor_config.local
12/15 17:15:31 DaemonCore: Command Socket at <XX.XX.XX.125:42708> 12/15
17:15:31 Initializing a VANILLA shadow for job 628.0 12/15 17:15:31
(628.0) (11668): Request to run on <XX.XX.XX.19:32770> was ACCEPTED
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:31 (628.0) (11668): ERROR "Error from starter on
vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor)
root@merle:~#                     
__________________________________________________

Here is the StarterLog on the execute machine : 

12/15 17:47:43 ******************************************************
12/15 17:47:43 ** condor_starter (CONDOR_STARTER) STARTING UP
12/15 17:47:43 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter
12/15 17:47:43 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:47:43 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:47:43 ** PID = 21528
12/15 17:47:43 ******************************************************
12/15 17:47:43 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:43 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:43 DaemonCore:
Command Socket at <XX.XX.XX.19:42374> 12/15 17:47:43 Done setting
resource limits 12/15 17:47:43 Communicating with shadow
<XX.XX.XX.14:52436> 12/15 17:47:43 Submitting machine is
"chagall.my.domain" 12/15 17:47:43 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:43 IWD: /nfs
12/15 17:47:43 Output file: /nfs/test.out
12/15 17:47:43 Error file: /nfs/test.err
12/15 17:47:43 Renice expr "1" evaluated to 1
12/15 17:47:43 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:43 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:43 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:43 ShutdownFast all jobs. 12/15
17:47:45 ****************************************************** 12/15
17:47:45 ** condor_starter (CONDOR_STARTER) STARTING UP 12/15 17:47:45
** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter 12/15
17:47:45 ** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:47:45 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:47:45 ** PID = 21531
12/15 17:47:45 ******************************************************
12/15 17:47:45 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:45 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:45 DaemonCore:
Command Socket at <XX.XX.XX.19:42375> 12/15 17:47:45 Done setting
resource limits 12/15 17:47:45 Communicating with shadow
<XX.XX.XX.14:52438> 12/15 17:47:45 Submitting machine is
"chagall.my.domain" 12/15 17:47:45 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:45 IWD: /nfs
12/15 17:47:45 Output file: /nfs/test.out
12/15 17:47:45 Error file: /nfs/test.err
12/15 17:47:45 Renice expr "1" evaluated to 1
12/15 17:47:45 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:45 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:45 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:45 ShutdownFast all jobs.
couronnes:~#                                   


Thanks again for being courageous enough to read until here :)

----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

Just a thought but the following error:

condor_exec.exe': Exec format error" 

Can occur when a binary has been compiled for a different architecture, are all of your Condor nodes running a Linux based OS and were the binaries compiled for Linux?

Also, your StarterLog on the execute machine is locating resources from NFS and is seeing a "Permission Denied" error, could there be configuration issues between machines and your NFS server?  e.g. the clients that run the job have access to the NFS and the problematic ones dont?

Just a few thoughts...

Shaun






-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx on behalf of Nicolas GUIOT
Sent: Fri 12/15/2006 5:01 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Shadow exception
 
Hi,

I have a problem with one specific job, and I need some help.

This job runs fine when started form a shell, on any machine (submit/execute)
This job fails when submitted from another submit machine, whoever the user is.
The submit machine can condor_submit some other jobs.

I'm using 6.7.18 on linux (debian sarge) machines, almost everythnig is on NFS.

Any help would be appreciated.
Thanks in advance
Nicolas

__________________________________________
Here is the submit file : 

#more longint.cmd
Universe = vanilla
Executable = /nfs/longint.sh
environment = LD_LIBRAIRY_PATH=/opt/intel_fc_80/lib
output = test.out
error  = test.err
log = test.log
notify_user = user@xxxxxxxxx
notification = error
Rank = Mips
queue
________________________________________________________
Then here is the test.log : 

001 (004.000.000) 12/15 17:30:59 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:30:59 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:01 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:01 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:03 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:03 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (004.000.000) 12/15 17:31:06 Job executing on host:
<XX.XX.XX.XX:32770> ...
007 (004.000.000) 12/15 17:31:06 Shadow exception!
        Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Failed to
execute '/ibpc/pogo/sacquin/Auxprog/Int erProp/runcontrol/longint.sh
condor_exec.exe': Exec format error 0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
009 (004.000.000) 12/15 17:34:56 Job was aborted by the user.
        via condor_rm (by user sacquin)
...
____________________________________________________________________
Here is the ShadowLog of the submit machine : 

12/15 17:15:28 ******************************************************
12/15 17:15:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/15 17:15:28 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow
12/15 17:15:28 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:15:28 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:15:28 ** PID = 11663
12/15 17:15:28 ******************************************************
12/15 17:15:28 Using config file: /nfs/condor/etc/condor_config
12/15 17:15:28 Using local config
files: /scratch/condor/condor_config.local 12/15 17:15:28 DaemonCore:
Command Socket at <XX.XX.XX.125:42705> 12/15 17:15:28 Initializing a
VANILLA shadow for job 628.0 12/15 17:15:29 (628.0) (11663): Request to
run on <XX.XX.XX.19:32770> was ACCEPTED 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(1) failed - errno 9 (Bad file descriptor) 12/15
17:15:29 (628.0) (11663): FileLock::obtain(2) failed - errno 9 (Bad
file descriptor) 12/15 17:15:29 (628.0) (11663): ERROR "Error from
starter on vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:29 (628.0) (11663): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:29 (628.0) (11663):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:30 (631.0) (11651): FileLock::obtain(1) failed - errno 9 (Bad
file descriptor) 12/15 17:15:30 (631.0) (11651): FileLock::obtain(2)
failed - errno 9 (Bad file descriptor) 12/15 17:15:31
****************************************************** 12/15 17:15:31
** condor_shadow (CONDOR_SHADOW) STARTING UP 12/15 17:15:31
** /nfs/condor-versions/condor-6.7.18/sbin/condor_shadow 12/15 17:15:31
** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:15:31 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:15:31 ** PID = 11668 12/15
17:15:31 ****************************************************** 12/15
17:15:31 Using config file: /nfs/condor/etc/condor_config 12/15
17:15:31 Using local config files: /scratch/condor/condor_config.local
12/15 17:15:31 DaemonCore: Command Socket at <XX.XX.XX.125:42708> 12/15
17:15:31 Initializing a VANILLA shadow for job 628.0 12/15 17:15:31
(628.0) (11668): Request to run on <XX.XX.XX.19:32770> was ACCEPTED
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor) 12/15
17:15:31 (628.0) (11668): ERROR "Error from starter on
vm1@xxxxxxxxxxxxxxxxxxxxx: Failed to execute'/nfs/longint.sh
condor_exec.exe': Exec format error" at line 597 in file pseudo_ops.C
12/15 17:15:31 (628.0) (11668): FileLock::obtain(1) failed - errno 9
(Bad file descriptor) 12/15 17:15:31 (628.0) (11668):
FileLock::obtain(2) failed - errno 9 (Bad file descriptor)
root@merle:~#                     
__________________________________________________

Here is the StarterLog on the execute machine : 

12/15 17:47:43 ******************************************************
12/15 17:47:43 ** condor_starter (CONDOR_STARTER) STARTING UP
12/15 17:47:43 ** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter
12/15 17:47:43 ** $CondorVersion: 6.7.18 Mar 22 2006 $
12/15 17:47:43 ** $CondorPlatform: I386-LINUX_RH9 $
12/15 17:47:43 ** PID = 21528
12/15 17:47:43 ******************************************************
12/15 17:47:43 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:43 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:43 DaemonCore:
Command Socket at <XX.XX.XX.19:42374> 12/15 17:47:43 Done setting
resource limits 12/15 17:47:43 Communicating with shadow
<XX.XX.XX.14:52436> 12/15 17:47:43 Submitting machine is
"chagall.my.domain" 12/15 17:47:43 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:43 IWD: /nfs
12/15 17:47:43 Output file: /nfs/test.out
12/15 17:47:43 Error file: /nfs/test.err
12/15 17:47:43 Renice expr "1" evaluated to 1
12/15 17:47:43 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:43 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:43 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:43 ShutdownFast all jobs. 12/15
17:47:45 ****************************************************** 12/15
17:47:45 ** condor_starter (CONDOR_STARTER) STARTING UP 12/15 17:47:45
** /nfs/condor-versions/condor-6.7.18/sbin/condor_starter 12/15
17:47:45 ** $CondorVersion: 6.7.18 Mar 22 2006 $ 12/15 17:47:45 **
$CondorPlatform: I386-LINUX_RH9 $ 12/15 17:47:45 ** PID = 21531
12/15 17:47:45 ******************************************************
12/15 17:47:45 Using config file: /nfs/condor/etc/condor_config
12/15 17:47:45 Using local config
files: /scratch/condor/condor_config.local 12/15 17:47:45 DaemonCore:
Command Socket at <XX.XX.XX.19:42375> 12/15 17:47:45 Done setting
resource limits 12/15 17:47:45 Communicating with shadow
<XX.XX.XX.14:52438> 12/15 17:47:45 Submitting machine is
"chagall.my.domain" 12/15 17:47:45 Starting a VANILLA universe job with
ID: 5.0 12/15 17:47:45 IWD: /nfs
12/15 17:47:45 Output file: /nfs/test.out
12/15 17:47:45 Error file: /nfs/test.err
12/15 17:47:45 Renice expr "1" evaluated to 1
12/15 17:47:45 About to exec /nfs/longint.sh condor_exec.exe
12/15 17:47:45 Create_Process: child failed with errno 13 (Permission
denied) before exec() 12/15 17:47:45 ERROR
"Create_Process(/nfs/longint.sh,condor_exec.exe, ...) failed" at line
378 in file os_proc.C 12/15 17:47:45 ShutdownFast all jobs.
couronnes:~#                                   


Thanks again for being courageous enough to read until here :)

----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR