[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer_in/output_files only if they exist



Hi Todd,

Follow-up question: is there a way to set something like

periodic_transfer_spool = 3600

so that the contents of the job's spool directory can be transferred back to the shadow's spool periodically? In combination with ON_EXIT_OR_EVICT that would give me periodic checkpointing if the job dies unexpectedly, in addition to when it is cleanly evicted.

I could fake this with some combination of periodic_hold and periodic_release, but my recollection is that hold sends a hard kill and doesn't leave time for a SIGTERM->allow time for job checkpoint and exit->SIGKILL cycle. If my job's checkpoint timer loses sync with condor's periodic_hold that would be a recipe for badput.

Cheers (from less cold but more snowy Syracuse),
Duncan.

> On Feb 11, 2019, at 3:03 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> 
> On 2/11/2019 11:50 AM, Duncan Brown wrote:
>> Hi Todd,
>> 
>> Ah, very nice, that's what I need!
>> 
>> Cheers,
>> Duncan.
>> 
> 
> Glad to help!
> 
> best regards from cold and snowy Madison,
> Todd
> 
> 
> 
> 
>>> On Feb 8, 2019, at 12:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>> 
>>> On 2/7/2019 8:00 AM, Duncan Brown wrote:
>>>> Hi Todd,
>>>> 
>>>> Is there a way to tell condor that it's OK if a specific file listed in transfer_input_files does not exist (and the same question with an output file)? The use case is using condor file i/o to manage a checkpoint file.
>>> 
>>> Hi Duncan,
>>> 
>>> TJ already answered the question above, but I am not certain you need to
>>> do the above to handle your checkpoint file use case. :)
>>> 
>>> When your submit file has
>>> 
>>>    when_to_transfer_output = ON_EXIT_OR_EVICT
>>> 
>>> what happens is when your job is evicted, any output files are
>>> transferred back to the SPOOL directory for that job on the submit
>>> machine.  When your job is rescheduled to run again, HTCondor first
>>> sends all the specified transfer_input files to the execute node, **and
>>> then subsequently also sends all the files stored in SPOOL**.   The
>>> point being your checkpoint file need not be listed explicitly in
>>> transfer_input_files at all... it will get transferred on restart
>>> assuming it was considered output from a previous run.
>>> 
>>> So imagine you have a job that has input data ('my_input_data'), output
>>> data ('my_output_data), and it periodically writes a checkpoint file
>>> ('ckpt_file').  Your submit file could look like:
>>> 
>>>    executable = foo.exe
>>>    when_to_transfer_output = ON_EXIT_OR_EVICT
>>>    transfer_input_files = my_input_data
>>>    transfer_output_files = my_output_data ckpt_file
>>> 
>>> With the above, the only issue may be your job going on hold if your job
>>> is evicted before it ever writes out its initial ckpt_file, because it
>>> will not exist and yet is explicitly declared in transfer_output_files.
>>> To prevent this case, you could make a zero-length ckpt_file on
>>> submission, and add it to transfer_input_files.  This way the job will
>>> never go on hold because all files listed in "transfer_output_files"
>>> will always exist.  Because HTCondor first sends the input files and
>>> then sends the spool files, on restart after a ckpt HTCondor will first
>>> send the zero-length ckpt file from transfer_intput_files, but then
>>> immediately overwrite it when the ckpt_file contents from the SPOOL
>>> directory (i.e. the ckpt_file contents from the last run) is sent.
>>> 
>>> Hope the above helps,
>>> Todd
>>> 
>>>> The use case is using condor file i/o to manage a checkpoint file. The first time the job is run, the checkpoint file does not exist so the job gets stuck in hold state. I want to be able to tell condor that it's OK that this file is not there.
>>>> 
>>>> Cheers,
>>>> Duncan.
>>>> 
>>> 
>>> 
>>> 

-- 

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993