[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer_in/output_files only if they exist



Hi Todd,

Ah, very nice, that's what I need!

Cheers,
Duncan.

> On Feb 8, 2019, at 12:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> 
> On 2/7/2019 8:00 AM, Duncan Brown wrote:
>> Hi Todd,
>> 
>> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.
> 
> Hi Duncan,
> 
> TJ already answered the question above, but I am not certain you need to 
> do the above to handle your checkpoint file use case. :)
> 
> When your submit file has
> 
>    when_to_transfer_output = ON_EXIT_OR_EVICT
> 
> what happens is when your job is evicted, any output files are 
> transferred back to the SPOOL directory for that job on the submit 
> machine.  When your job is rescheduled to run again, HTCondor first 
> sends all the specified transfer_input files to the execute node, **and 
> then subsequently also sends all the files stored in SPOOL**.   The 
> point being your checkpoint file need not be listed explicitly in 
> transfer_input_files at all... it will get transferred on restart 
> assuming it was considered output from a previous run.
> 
> So imagine you have a job that has input data ('my_input_data'), output 
> data ('my_output_data), and it periodically writes a checkpoint file 
> ('ckpt_file').  Your submit file could look like:
> 
>    executable = foo.exe
>    when_to_transfer_output = ON_EXIT_OR_EVICT
>    transfer_input_files = my_input_data
>    transfer_output_files = my_output_data ckpt_file
> 
> With the above, the only issue may be your job going on hold if your job 
> is evicted before it ever writes out its initial ckpt_file, because it 
> will not exist and yet is explicitly declared in transfer_output_files. 
> To prevent this case, you could make a zero-length ckpt_file on 
> submission, and add it to transfer_input_files.  This way the job will 
> never go on hold because all files listed in "transfer_output_files" 
> will always exist.  Because HTCondor first sends the input files and 
> then sends the spool files, on restart after a ckpt HTCondor will first 
> send the zero-length ckpt file from transfer_intput_files, but then 
> immediately overwrite it when the ckpt_file contents from the SPOOL 
> directory (i.e. the ckpt_file contents from the last run) is sent.
> 
> Hope the above helps,
> Todd
> 
>> The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
>> 
>> Cheers,
>> Duncan.
>> 
> 
> 
> 
> -- 
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685

-- 

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993