[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Error files not getting returned to submitter



I have a user on my condor cluster who has the following submit file

Universe   = vanilla
Executable = pgspythia
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = fort.18, fort.88, fort.67, slha.par, cdf.par
transfer_output_files = fort.99, fort.66
transfer_error = true
transfer_output = true
Log        = simple.$(Cluster).$(Process).log
Output     = simple.$(Cluster).$(Process).out
Error      = simple.$(Cluster).$(Process).error

Requirements = ( OpSys == "LINUX" )
Queue

When the above job is submitted, it errors out in a few seconds.
The entry in the local job log is:

000 (1827.000.000) 06/03 11:52:58 Job submitted from host: 
<131.225.167.42:35847
>
...
001 (1827.000.000) 06/03 11:53:05 Job executing on host: 
<131.225.167.201:32794>
...
007 (1827.000.000) 06/03 11:53:05 Shadow exception!
        Can no longer talk to condor_starter <131.225.167.201:32794>
        0  -  Run Bytes Sent By Job
        6433561  -  Run Bytes Received By Job
...
and this error continues repeatedly as the job continuously tries
to restart.


If I look in the condor/execute subdirectory where the process ran,
I see that the executable dumped core, 
and left the following error message:

[root@fnpc201 dir_16110]# cat simple.1827.0.error
open: No such file or directory
apparent state: unit 30 named mass_width_02.mc
lately writing sequential formatted external IO

-------------------------------------------------
But this error file is never transferred back to the user's
directory.  That error file stays blank.
Obviously the user can fix his problem by just making the other 
input file exist and get transferred over.  But it would be nice 
to figure out why does the error message not get back to where it 
is supposed to go.

>From StartLog

6/3 11:53:04 DaemonCore: Command received via TCP from host 
<131.225.167.42:4161
9>
6/3 11:53:04 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling 
handler
(command_activate_claim)
6/3 11:53:04 vm1: Got activate_claim request from shadow 
(<131.225.167.42:41619>
)
6/3 11:53:04 vm1: Remote job ID is 1827.0
6/3 11:53:04 vm1: Got universe "VANILLA" (5) from request classad
6/3 11:53:04 vm1: State change: claim-activation protocol successful
6/3 11:53:04 vm1: Changing activity: Idle -> Busy
6/3 11:53:06 Starter pid 16069 exited with status 4
6/3 11:53:06 vm1: State change: starter exited

6/3 11:53:04 ******************************************************
6/3 11:53:04 ** condor_starter (CONDOR_STARTER) STARTING UP
6/3 11:53:04 ** /export/osg/grid/condor/sbin/condor_starter
6/3 11:53:04 ** $CondorVersion: 6.7.6 Mar 15 2005 $
6/3 11:53:04 ** $CondorPlatform: I386-LINUX_RH9 $
6/3 11:53:04 ** PID = 16069
6/3 11:53:04 ******************************************************
6/3 11:53:04 Using config file: /export/osg/grid/condor/etc/condor_config
6/3 11:53:04 Using local config files: 
/export/osg/grid/condor/etc/group_params.config 
/export/osg/grid/condor/local.fnpc201/condor_config.local
6/3 11:53:04 DaemonCore: Command Socket at <131.225.167.201:49653>
6/3 11:53:04 Done setting resource limits
6/3 11:53:04 Submitting machine is "fngp-osg.fnal.gov"
6/3 11:53:05 File transfer completed successfully.
6/3 11:53:05 Starting a VANILLA universe job with ID: 1827.0
6/3 11:53:05 IWD: /local/stage1/condor/execute/dir_16069
6/3 11:53:05 Output file: 
/local/stage1/condor/execute/dir_16069/simple.1827.0.out
6/3 11:53:05 Error file: 
/local/stage1/condor/execute/dir_16069/simple.1827.0.error
6/3 11:53:05 About to exec 
/local/stage1/condor/execute/dir_16069/condor_exec.exe
6/3 11:53:05 Create_Process succeeded, pid=16071
6/3 11:53:05 Process exited, pid=16071, signal=6
6/3 11:53:05 ReliSock: put_file: Failed to open file 
/local/stage1/condor/execute/dir_16069/fort.99, errno = 2.
6/3 11:53:05 ERROR "DoUpload: Failed to send file 
/local/stage1/condor/execute/dir_16069/fort.99, exiting at 1577
" at line 1576 in file file_transfer.C
6/3 11:53:05 ShutdownFast all jobs.

------------------------------------------


Any idea what might be going on here?
Thanks

Steve Timm



------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team