[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] simple condor error



Steve,

The error message that says failure to execute 'blah condor_exec.exe ..." is actually showing you the command, followed by the args, _including_ arg 0, which is the name "condor_exec.exe" used by condor for all jobs. This is confusing. The error message should not be interpreted as meaning that Condor is trying to access a file literally named 'blah condor_exec.exe'. I'm fixing the error message to make this clear.

Does the #! line at the top of the script point to the correct path for csh? I believe an error in that could cause the problem you are seeing.

--Dan

On Jun 20, 2006, at 3:52 PM, Steven Timm wrote:


I have a user trying to submit a simple vanilla universe job.
This is the job submit file:

[root@ilcsim LinacBench1]# more myJob1.run
universe = vanilla
executable = /prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh
transfer_output = true
transfer_error = true
transfer_executable = false
log = myjob.log.$(Cluster).$(Process)
notification = NEVER
remote_initialdir = /prj/ilc/lebrun/CHEF/LinacBench1
output = myjob.out.$(Cluster).$(Process)
error = myjob.err.$(Cluster).$(Process)
queue
[root@ilcsim LinacBench1]# more myJob1.run
universe = vanilla
executable = /prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh
transfer_output = true
transfer_error = true
transfer_executable = false
log = myjob.log.$(Cluster).$(Process)
notification = NEVER
remote_initialdir = /prj/ilc/lebrun/CHEF/LinacBench1
output = myjob.out.$(Cluster).$(Process)
error = myjob.err.$(Cluster).$(Process)
queue

Here's what's in ShadowLog

6/20 13:57:14 ******************************************************
6/20 13:57:14 ** condor_shadow (CONDOR_SHADOW) STARTING UP
6/20 13:57:14 ** /opt/condor-6.7.19/sbin/condor_shadow
6/20 13:57:14 ** $CondorVersion: 6.7.19 May 10 2006 $
6/20 13:57:14 ** $CondorPlatform: I386-LINUX_RH9 $
6/20 13:57:14 ** PID = 24934
6/20 13:57:14 ** Log last touched 6/13 15:32:28
6/20 13:57:14 ******************************************************
6/20 13:57:14 Using config file: /etc/condor/condor_config
6/20 13:57:14 Using local config files:
/opt/condor-6.7.19/local.ilcsim/condor_config.local
6/20 13:57:14 DaemonCore: Command Socket at <131.225.110.52:33278>
6/20 13:57:14 Initializing a VANILLA shadow for job 9.0
6/20 13:57:14 (9.0) (24934): Request to run on <131.225.110.52:32775> was
ACCEPT
ED
6/20 13:57:14 (9.0) (24934): ERROR "Error from starter on
vm1@xxxxxxxxxxxxxxx: F
ailed to execute '/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh
condor_exec.exe':
  No such file or directory" at line 597 in file pseudo_ops.C
6/20 13:57:16 **********************************************

-- \

And the same in the UserLog

[root@ilcsim log]# more /prj/ilc/lebrun/CHEF/LinacBench1/myjob.log.13.0
000 (013.000.000) 06/20 15:27:23 Job submitted from host:
<131.225.110.52:32774>
...
001 (013.000.000) 06/20 15:27:27 Job executing on host:
<131.225.110.52:32775>
...
007 (013.000.000) 06/20 15:27:27 Shadow exception!
         Error from starter on vm1@xxxxxxxxxxxxxxx: Failed to execute
'/prj/ilc/l
ebrun/CHEF/LinacBench1/runGrid.csh condor_exec.exe': No such file or
directory
         0  -  Run Bytes Sent By Job
         0  -  Run Bytes Received By Job
...
001 (013.000.000) 06/20 15:27:30 Job executing on host:
<131.225.110.52:32775>
...
007 (013.000.000) 06/20 15:27:30 Shadow exception!
         Error from starter on vm1@xxxxxxxxxxxxxxx: Failed to execute
'/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh condor_exec.exe': No such file or
directory
----------------------------------------------------
[root@ilcsim LinacBench1]# ls -l
/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh
-rwxr-xr-x    1 lebrun   bphys         262 Jun 20 15:26
/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh

------------------------------------------------------------------

So, what's wrong-- why is it trying to execute
'/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh condor_exec.exe'
which doesn't exist,
instead of
'/prj/ilc/lebrun/CHEF/LinacBench1/runGrid.csh'
which does?

Something basic is wrong.  Jobs being run by other users using
a very similar basic job file don't have this problem.
Any idea what?

Steve Timm




------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR