[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] job running on two hosts?
- Date: Thu, 18 Nov 2004 11:27:46 -0500
- From: Dan Christensen <jdc@xxxxxx>
- Subject: Re: [Condor-users] job running on two hosts?
Erik Paulson <epaulson@xxxxxxxxxxx> writes:
> On Wed, Nov 17, 2004 at 09:49:10AM -0500, Dan Christensen wrote:
>> Dan Christensen <jdc@xxxxxx> writes:
>> > And when it ran the second time, it seemed to start at the beginning,
>> > because it tried to open its output file, and it noticed that it
>> > already existed and quit right away.
> Files are opened on the submit machine as the job runs, and are
> not stored in the checkpoint. (File pointers are, so we know where we
> left of). If there is no checkpoint, we will start from the beginning.
> If you job looks at what it might have previously written, it is in
> violation of one of the restrictions on standard universe jobs:
Right, that makes sense. Normally the job should checkpoint, so this
won't be a problem, but when it terminates abnormally it does cause a
So I could remove checking for the existence of the output file, but
I would like to avoid overwriting an existing file if I accidentally
ask my job to send output to the same place twice. I guess Condor
can't easily accommodate this mode of working; maybe I'll add some
random noise to the file name...
>> Here's another clue I just found: I got an e-mail from Condor saying
>> that condor_schedd died on 184.108.40.206 due to a SEGV. I guess that
>> would explain the missing information in the user log file.
>> > Date: Tue, 16 Nov 2004 02:11:34 -0500
>> > "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
>> > Condor will automatically restart this process in 10 seconds.
>> But now the question is, why did it die?
> We'd need to see the complete logfile of the schedd. Please send it to
> condor-admin@xxxxxxxxxxx, and we'll try and debug it off-line.
Ok, will do.