[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer_in/output_files only if they exist



On 2/7/2019 8:00 AM, Duncan Brown wrote:
> Hi Todd,
> 
> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.

Hi Duncan,

TJ already answered the question above, but I am not certain you need to 
do the above to handle your checkpoint file use case. :)

When your submit file has

    when_to_transfer_output = ON_EXIT_OR_EVICT

what happens is when your job is evicted, any output files are 
transferred back to the SPOOL directory for that job on the submit 
machine.  When your job is rescheduled to run again, HTCondor first 
sends all the specified transfer_input files to the execute node, **and 
then subsequently also sends all the files stored in SPOOL**.   The 
point being your checkpoint file need not be listed explicitly in 
transfer_input_files at all... it will get transferred on restart 
assuming it was considered output from a previous run.

So imagine you have a job that has input data ('my_input_data'), output 
data ('my_output_data), and it periodically writes a checkpoint file 
('ckpt_file').  Your submit file could look like:

    executable = foo.exe
    when_to_transfer_output = ON_EXIT_OR_EVICT
    transfer_input_files = my_input_data
    transfer_output_files = my_output_data ckpt_file

With the above, the only issue may be your job going on hold if your job 
is evicted before it ever writes out its initial ckpt_file, because it 
will not exist and yet is explicitly declared in transfer_output_files. 
To prevent this case, you could make a zero-length ckpt_file on 
submission, and add it to transfer_input_files.  This way the job will 
never go on hold because all files listed in "transfer_output_files" 
will always exist.  Because HTCondor first sends the input files and 
then sends the spool files, on restart after a ckpt HTCondor will first 
send the zero-length ckpt file from transfer_intput_files, but then 
immediately overwrite it when the ckpt_file contents from the SPOOL 
directory (i.e. the ckpt_file contents from the last run) is sent.

Hope the above helps,
Todd

>The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
> 
> Cheers,
> Duncan.
> 



-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685