[HTCondor-devel] [RE]Re: Delayed Transfer with Signals(self-checkpoint)and checkpoint_exit_code


Date: Tue, 19 Jul 2022 11:02:20 +0900
From: Geonmo Ryu <geonmo@xxxxxxxxxxx>
Subject: [HTCondor-devel] [RE]Re: Delayed Transfer with Signals(self-checkpoint)and checkpoint_exit_code

Hello, Todd.


I think I explained it in a confusing way.


When I tested it, there was an inconvenience because ON_EXIT_OR_EVICT sends transfer_output_files instead of transfer_checkpoint_files, as shown in the manual.


We are supposed to put the final result file in transfer_output_files. However, when we actually wrote the test code, we found that the checkpoint occurred while the result file was not created, which led to the job being held.


To solve this problem, I modified the vanilla_proc.cpp file of condor_starterv6.1. 

(https://github.com/geonmo/htcondor/blob/checkpoint_modify/src/condor_starter.V6.1/vanilla_proc.cpp#L953-L958)


I added some code to send checkpoint files where isSoftKilling is checked, and We confirmed that it works as we intended.


I was wondering what the problem was with such a simple modification.


Because we didn't understand the whole HTConder code, so we thought that code could cause problems.



If there is no particular issue, I would like to create a pull-request the code. Since the code is too simple, you need to refine it, but I think it's a better way to use the delayed transfer with signals method.


Regards,


-- Geonmo



----- Original Message -----
From : Todd L Miller <tlmiller@xxxxxxxxxxx>
To : "Geonmo Ryu" <geonmo@xxxxxxxxxxx>
Cc : <htcondor-devel@xxxxxxxxxxx>
Sent : 2022-07-19 01:44:20
Subject : Re: [HTCondor-devel] Delayed Transfer with Signals(self-checkpoint)and checkpoint_exit_code


I don't know what you're asking here. The "delayed transfer with
signals" method works -- when it does -- by causing setting the soft-kill
signal to the one which causes the job to produce a checkpoint. If the
job then produces a checkpoint before the soft-kill timeout, and
when_to_transfer files is set to ON_EXIT_OR_EVICT, and
transfer_output_files includes the checkpoint, then it will be transferred
as a result of the eviction. The last condition wasn't explicitly stated
in that section of the manual, so my apologies if that caused you
confusion.

There is currently no way to use transfer_checkpoint_files instead
of transfer_output_files on an eviction.

- ToddM
[← Prev in Thread] Current Thread [Next in Thread→]
  • [HTCondor-devel] [RE]Re: Delayed Transfer with Signals(self-checkpoint)and checkpoint_exit_code, Geonmo Ryu <=