[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Error files not getting returned to submitter



Thanks for the suggestion.  I removed the transfer_output_files
specification from the submit file and then got the error message
(and the output files) back.

Steve Timm


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

On Fri, 3 Jun 2005, Erik Paulson wrote:

> On Fri, Jun 03, 2005 at 12:47:42PM -0500, Steven Timm wrote:
> > 
> > I have a user on my condor cluster who has the following submit file
> > 
> > Universe   = vanilla
> > Executable = pgspythia
> > should_transfer_files = YES
> > when_to_transfer_output = ON_EXIT
> > transfer_input_files = fort.18, fort.88, fort.67, slha.par, cdf.par
> > transfer_output_files = fort.99, fort.66
> > transfer_error = true
> > transfer_output = true
> > Log        = simple.$(Cluster).$(Process).log
> > Output     = simple.$(Cluster).$(Process).out
> > Error      = simple.$(Cluster).$(Process).error
> > 
> > Requirements = ( OpSys == "LINUX" )
> > Queue
> > 
> > When the above job is submitted, it errors out in a few seconds.
> > The entry in the local job log is:
> > 
> > 000 (1827.000.000) 06/03 11:52:58 Job submitted from host: 
> > <131.225.167.42:35847
> > >
> > ...
> > 001 (1827.000.000) 06/03 11:53:05 Job executing on host: 
> > <131.225.167.201:32794>
> > ...
> > 007 (1827.000.000) 06/03 11:53:05 Shadow exception!
> >         Can no longer talk to condor_starter <131.225.167.201:32794>
> >         0  -  Run Bytes Sent By Job
> >         6433561  -  Run Bytes Received By Job
> > ...
> > and this error continues repeatedly as the job continuously tries
> > to restart.
> > 
> > 
> > If I look in the condor/execute subdirectory where the process ran,
> > I see that the executable dumped core, 
> > and left the following error message:
> > 
> > [root@fnpc201 dir_16110]# cat simple.1827.0.error
> > open: No such file or directory
> > apparent state: unit 30 named mass_width_02.mc
> > lately writing sequential formatted external IO
> > 
> > -------------------------------------------------
> > But this error file is never transferred back to the user's
> > directory.  That error file stays blank.
> > Obviously the user can fix his problem by just making the other 
> > input file exist and get transferred over.  But it would be nice 
> > to figure out why does the error message not get back to where it 
> > is supposed to go.
> > 
> 
> The problem is that Condor is chosing a different error to report :)
> 
> Condor sees that your job is exiting - that it's with an error is really
> not Condor's decision to make - all we know is that the process exited
> (probably with a signal, and that WCOREDUMP is probably true) and it didn't
> exit because Condor told it to, so the job has "completed"
> 
> What's going wrong is that you told Condor to transfer back fort.99 and
> fort.66  - but those files don't exist, and the daemon isn't sure why,
> so it aborts itself. Is that the ideal thing to do in this specific case?
> No, clearly not, but we don't yet have the smarts in Condor to do the
> right thing in this case - and trying again never does the wrong thing
> (it just never does the right thing either :)
> 
> transfer_output_files can be a tricky option to use - most people are
> better off not saying what to transfer back and let Condor figure it
> out automatically.
> 
> -Erik
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>