[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Error files not getting returned to submitter



On Fri, Jun 03, 2005 at 12:47:42PM -0500, Steven Timm wrote:
> 
> I have a user on my condor cluster who has the following submit file
> 
> Universe   = vanilla
> Executable = pgspythia
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
> transfer_input_files = fort.18, fort.88, fort.67, slha.par, cdf.par
> transfer_output_files = fort.99, fort.66
> transfer_error = true
> transfer_output = true
> Log        = simple.$(Cluster).$(Process).log
> Output     = simple.$(Cluster).$(Process).out
> Error      = simple.$(Cluster).$(Process).error
> 
> Requirements = ( OpSys == "LINUX" )
> Queue
> 
> When the above job is submitted, it errors out in a few seconds.
> The entry in the local job log is:
> 
> 000 (1827.000.000) 06/03 11:52:58 Job submitted from host: 
> <131.225.167.42:35847
> >
> ...
> 001 (1827.000.000) 06/03 11:53:05 Job executing on host: 
> <131.225.167.201:32794>
> ...
> 007 (1827.000.000) 06/03 11:53:05 Shadow exception!
>         Can no longer talk to condor_starter <131.225.167.201:32794>
>         0  -  Run Bytes Sent By Job
>         6433561  -  Run Bytes Received By Job
> ...
> and this error continues repeatedly as the job continuously tries
> to restart.
> 
> 
> If I look in the condor/execute subdirectory where the process ran,
> I see that the executable dumped core, 
> and left the following error message:
> 
> [root@fnpc201 dir_16110]# cat simple.1827.0.error
> open: No such file or directory
> apparent state: unit 30 named mass_width_02.mc
> lately writing sequential formatted external IO
> 
> -------------------------------------------------
> But this error file is never transferred back to the user's
> directory.  That error file stays blank.
> Obviously the user can fix his problem by just making the other 
> input file exist and get transferred over.  But it would be nice 
> to figure out why does the error message not get back to where it 
> is supposed to go.
> 

The problem is that Condor is chosing a different error to report :)

Condor sees that your job is exiting - that it's with an error is really
not Condor's decision to make - all we know is that the process exited
(probably with a signal, and that WCOREDUMP is probably true) and it didn't
exit because Condor told it to, so the job has "completed"

What's going wrong is that you told Condor to transfer back fort.99 and
fort.66  - but those files don't exist, and the daemon isn't sure why,
so it aborts itself. Is that the ideal thing to do in this specific case?
No, clearly not, but we don't yet have the smarts in Condor to do the
right thing in this case - and trying again never does the wrong thing
(it just never does the right thing either :)

transfer_output_files can be a tricky option to use - most people are
better off not saying what to transfer back and let Condor figure it
out automatically.

-Erik