[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job standard output gets overwritten after preemption

On Mar 28, 2019, at 6:29 AM, David Rebatto <david.rebatto@xxxxxxxxxx> wrote:

Hi Todd!

Il 27/03/19 19:51, Todd Tannenbaum ha scritto:
On 3/27/2019 12:21 PM, David Rebatto wrote:

I noticed that vanilla jobs restarting after preemption overwrite the 
output of previous execution.
Is there a way to instruct them to *append* to output file instead?
By "output file" above, I assume you mean the job's stdout as defined by 
the "output=filename" line in your submit file?

Yes, see below.

Are you using HTCondor's file transfer or a shared file system?

File transfer, the jobs are flocking away from the submission pool.

If you are using HTCondor's file transfer, the issue is likely the 
output from previous executions is not being transferred back upon 
preemption to the submit machine from the execute machine. I think you 
can achieve what you want by adding the following line to your submit file:
    when_to_transfer_output = ON_EXIT_OR_EVICT

I have it in the submit file, and it is working. Here's the (stripped) submit file:

universe        = vanilla
requirements    = TARGET.ClusterName == "gymno-pool"
executable      = test_script.sh
args            = 600
output          = jobtest_condor.out.$(Cluster).$(Process)
error           = jobtest_condor.err.$(Cluster).$(Process)
log             = jobtest_condor.log.$(Cluster).$(Process)

when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_output_remaps = "squares.txt=squares.txt.$(Cluster).$(Process)"

The job prints some debug information on stdout, its real output on 'squares.txt', and a checkpoint in 'checkpoint.txt' whenever it gets a SIGTERM.
When it restarts after preemption all the files are there, and it starts appending output lines to squares.txt resuming from what it saved in checkpoint.txt.
Still, in jobtest_condor.out.* there are debug messages from last execution only.

I attach the full submit file and the script, maybe I'm doing something wrong in there.

In order to have the new runâs output appended to the old runâs output (instead of overwriting), you also need to add one of the following lines to your submit file:

+WantCheckpointSignal = true

+WantFTOnCheckpoint = true

 - Jaime