[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?



On Wed, Nov 17, 2004 at 09:49:10AM -0500, Dan Christensen wrote:
> Dan Christensen <jdc@xxxxxx> writes:
> 
> >
> > And when it ran the second time, it seemed to start at the beginning,
> > because it tried to open its output file, and it noticed that it
> > already existed and quit right away.
> 

Files are opened on the submit machine as the job runs, and are
not stored in the checkpoint. (File pointers are, so we know where we
left of). If there is no checkpoint, we will start from the beginning. 
If you job looks at what it might have previously written, it is in
violation of one of the restrictions on standard universe jobs:
http://www.cs.wisc.edu/condor/manual/v6.6.7/2_4Road_map_Running.html#SECTION00341100000000000000

> Here's another clue I just found:  I got an e-mail from Condor saying
> that condor_schedd died on 129.100.75.77 due to a SEGV.  I guess that
> would explain the missing information in the user log file.
> 
> > Date: Tue, 16 Nov 2004 02:11:34 -0500
> > 
> > "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
> > Condor will automatically restart this process in 10 seconds.
> 
> But now the question is, why did it die?
> 

We'd need to see the complete logfile of the schedd. Please send it to
condor-admin@xxxxxxxxxxx, and we'll try and debug it off-line.

-Erik