[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Error files not getting returned to submitter
- Date: Fri, 3 Jun 2005 13:12:24 -0500 (CDT)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: Re: [Condor-users] Error files not getting returned to submitter
Thanks for the suggestion. I removed the transfer_output_files
specification from the submit file and then got the error message
(and the output files) back.
Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team
On Fri, 3 Jun 2005, Erik Paulson wrote:
> On Fri, Jun 03, 2005 at 12:47:42PM -0500, Steven Timm wrote:
> > I have a user on my condor cluster who has the following submit file
> > Universe = vanilla
> > Executable = pgspythia
> > should_transfer_files = YES
> > when_to_transfer_output = ON_EXIT
> > transfer_input_files = fort.18, fort.88, fort.67, slha.par, cdf.par
> > transfer_output_files = fort.99, fort.66
> > transfer_error = true
> > transfer_output = true
> > Log = simple.$(Cluster).$(Process).log
> > Output = simple.$(Cluster).$(Process).out
> > Error = simple.$(Cluster).$(Process).error
> > Requirements = ( OpSys == "LINUX" )
> > Queue
> > When the above job is submitted, it errors out in a few seconds.
> > The entry in the local job log is:
> > 000 (1827.000.000) 06/03 11:52:58 Job submitted from host:
> > <220.127.116.11:35847
> > >
> > ...
> > 001 (1827.000.000) 06/03 11:53:05 Job executing on host:
> > <18.104.22.168:32794>
> > ...
> > 007 (1827.000.000) 06/03 11:53:05 Shadow exception!
> > Can no longer talk to condor_starter <22.214.171.124:32794>
> > 0 - Run Bytes Sent By Job
> > 6433561 - Run Bytes Received By Job
> > ...
> > and this error continues repeatedly as the job continuously tries
> > to restart.
> > If I look in the condor/execute subdirectory where the process ran,
> > I see that the executable dumped core,
> > and left the following error message:
> > [root@fnpc201 dir_16110]# cat simple.1827.0.error
> > open: No such file or directory
> > apparent state: unit 30 named mass_width_02.mc
> > lately writing sequential formatted external IO
> > -------------------------------------------------
> > But this error file is never transferred back to the user's
> > directory. That error file stays blank.
> > Obviously the user can fix his problem by just making the other
> > input file exist and get transferred over. But it would be nice
> > to figure out why does the error message not get back to where it
> > is supposed to go.
> The problem is that Condor is chosing a different error to report :)
> Condor sees that your job is exiting - that it's with an error is really
> not Condor's decision to make - all we know is that the process exited
> (probably with a signal, and that WCOREDUMP is probably true) and it didn't
> exit because Condor told it to, so the job has "completed"
> What's going wrong is that you told Condor to transfer back fort.99 and
> fort.66 - but those files don't exist, and the daemon isn't sure why,
> so it aborts itself. Is that the ideal thing to do in this specific case?
> No, clearly not, but we don't yet have the smarts in Condor to do the
> right thing in this case - and trying again never does the wrong thing
> (it just never does the right thing either :)
> transfer_output_files can be a tricky option to use - most people are
> better off not saying what to transfer back and let Condor figure it
> out automatically.
> Condor-users mailing list