[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?



Erik Paulson <epaulson@xxxxxxxxxxx> writes:

> On Wed, Nov 17, 2004 at 09:49:10AM -0500, Dan Christensen wrote:
>> Dan Christensen <jdc@xxxxxx> writes:
>> 
>> >
>> > And when it ran the second time, it seemed to start at the beginning,
>> > because it tried to open its output file, and it noticed that it
>> > already existed and quit right away.
>> 
>
> Files are opened on the submit machine as the job runs, and are
> not stored in the checkpoint. (File pointers are, so we know where we
> left of). If there is no checkpoint, we will start from the beginning. 
> If you job looks at what it might have previously written, it is in
> violation of one of the restrictions on standard universe jobs:

Right, that makes sense.  Normally the job should checkpoint, so this
won't be a problem, but when it terminates abnormally it does cause a
problem.

So I could remove checking for the existence of the output file, but
I would like to avoid overwriting an existing file if I accidentally
ask my job to send output to the same place twice.  I guess Condor
can't easily accommodate this mode of working; maybe I'll add some
random noise to the file name...

> http://www.cs.wisc.edu/condor/manual/v6.6.7/2_4Road_map_Running.html#SECTION00341100000000000000
>
>> Here's another clue I just found:  I got an e-mail from Condor saying
>> that condor_schedd died on 129.100.75.77 due to a SEGV.  I guess that
>> would explain the missing information in the user log file.
>> 
>> > Date: Tue, 16 Nov 2004 02:11:34 -0500
>> > 
>> > "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
>> > Condor will automatically restart this process in 10 seconds.
>> 
>> But now the question is, why did it die?
>
> We'd need to see the complete logfile of the schedd. Please send it to
> condor-admin@xxxxxxxxxxx, and we'll try and debug it off-line.

Ok, will do.

Dan