[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer_in/output_files only if they exist



On 2/11/2019 11:50 AM, Duncan Brown wrote:
> Hi Todd,
> 
> Ah, very nice, that's what I need!
> 
> Cheers,
> Duncan.
> 

Glad to help!

best regards from cold and snowy Madison,
Todd




>> On Feb 8, 2019, at 12:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>
>> On 2/7/2019 8:00 AM, Duncan Brown wrote:
>>> Hi Todd,
>>>
>>> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.
>>
>> Hi Duncan,
>>
>> TJ already answered the question above, but I am not certain you need to
>> do the above to handle your checkpoint file use case. :)
>>
>> When your submit file has
>>
>>     when_to_transfer_output = ON_EXIT_OR_EVICT
>>
>> what happens is when your job is evicted, any output files are
>> transferred back to the SPOOL directory for that job on the submit
>> machine.  When your job is rescheduled to run again, HTCondor first
>> sends all the specified transfer_input files to the execute node, **and
>> then subsequently also sends all the files stored in SPOOL**.   The
>> point being your checkpoint file need not be listed explicitly in
>> transfer_input_files at all... it will get transferred on restart
>> assuming it was considered output from a previous run.
>>
>> So imagine you have a job that has input data ('my_input_data'), output
>> data ('my_output_data), and it periodically writes a checkpoint file
>> ('ckpt_file').  Your submit file could look like:
>>
>>     executable = foo.exe
>>     when_to_transfer_output = ON_EXIT_OR_EVICT
>>     transfer_input_files = my_input_data
>>     transfer_output_files = my_output_data ckpt_file
>>
>> With the above, the only issue may be your job going on hold if your job
>> is evicted before it ever writes out its initial ckpt_file, because it
>> will not exist and yet is explicitly declared in transfer_output_files.
>> To prevent this case, you could make a zero-length ckpt_file on
>> submission, and add it to transfer_input_files.  This way the job will
>> never go on hold because all files listed in "transfer_output_files"
>> will always exist.  Because HTCondor first sends the input files and
>> then sends the spool files, on restart after a ckpt HTCondor will first
>> send the zero-length ckpt file from transfer_intput_files, but then
>> immediately overwrite it when the ckpt_file contents from the SPOOL
>> directory (i.e. the ckpt_file contents from the last run) is sent.
>>
>> Hope the above helps,
>> Todd
>>
>>> The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
>>>
>>> Cheers,
>>> Duncan.
>>>
>>
>>
>>