[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] file transfer problems with vanilla job



I plan to run some very long simulations that can go on for months.  For
performance reasons I use the Intel Fortran Compiler and Intel Math
Kernel Library, therefore the jobs must be submitted vanilla.  The
executable code has its own checkpointing mechanism.  I want the
checkpoint file and other output files transferred back to the submit
node whenever the job is preempted or vacated from the execute node, or
if the job is removed from the job queue.  Condor also needs to be able
to send the checkpoint file and a log file back as input when the job
restarts.

My problem is that the output files are not coming back when the job is
evicted from a node (by Condor or by me using condor_vacate or
condor_hold) or when it is removed from the queue (by me using
condor_rm), and if I do eventually get them to come back, I'm not sure
how to tell Condor which ones to send back to use in restarting the job.

The submit node is in the same pool as the execute nodes (same CM, no
flocking involved), and it does not share a common FILESYSTEM_DOMAIN
with the execute nodes.  I am using Condor 6.6.6.

The program that I run basically uses four types of files:

  1. "init" file: Contains all the data required to start the job from
     scratch.  If the "ckpt" file is present, that file is read and the
     job continues from the last checkpoint;  if "ckpt" does not exist,
     then the "init" file is read and the job starts from the beginning.
  2. "ckpt" file: Contains the minimal data set needed to continue an
     interrupted job, and is periodically overwritten with newer data
     sets.  The first thing that the program does is search for this
     file (using Fortran inquire(file="ckpt",exist=ex))
  3. "rlog" file: The running log of the job, contains some data and
     job status information not needed to restart the job.  When the job
     starts from scratch using an "init" file, a new "rlog" file is
     created;  when the job restarts from a "ckpt" file, new records are
     appended to the existing "rlog" file.
  4. "data" file: Written periodically during the job run and numbered
     sequentially (not appended or overwritten).  Contains data to be
     analyzed by other postprocessing tools.

Here is what I have in the submit file, which works find if the job can
run from start to finish without being evicted:

    Universe        = vanilla
    Initialdir      = .
    Executable      = ./main
    Error           = ./condor.err
    Log             = ./condor.log
    should_transfer_files   = YES
    when_to_transfer_output = ON_EXIT_OR_EVICT
    transfer_input_files = init
    Queue

When the jobs is evicted or removed, I expect to find the latest "ckpt"
file (if one has already been written), an "rlog" file, and any number
of "data" files.  Unfortunately nothing comes back and the job always
restarts from scratch, and I cannot figure out why.

If the job is evicted, then when it restarts, it will need as input:
the "ckpt" if it has already been created, the "init" file in case there
is no "ckpt" file, and the "rlog" if it has been created so that new
records can be appended to it.  How do I tell Condor that it needs to
send back these files, especially the "ckpt" and "rlog" files, which
might not yet exist if the job was interrupted early.  By the way, any
numbered "data" file that does come back need not be returned to the
execute node since they are never needed as input.

I've tried adding "transfer_output_files" to the submit script, but ran
into three problems:

  1. The "init" file becomes corrupted when it is sent to the execute
     node (maybe a bug?).
  2. Condor panics if it cannot find an output file explicitly listed in
     "transfer_output_files" (e.g., when the "ckpt" file has not yet
     been written).
  3. An unknown number of "data" files are created, and those not
     explicitly listed in "transfer_output_files" are lost.

I would really appreciate any help with this.  Thanks!

Dewey

-- 
Mr. De-Wei Yin, MASc, PEng
Dept of Chemical & Biological Engineering tel: +1 608 262-3370
University of Wisconsin-Madison           fax: +1 608 262-5434
1415 Engineering Drive                    dyin at cae dot wisc dot edu
Madison WI 53706-1691 USA                 www.engr.wisc.edu/groups/mtsm/