[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universe under windows



We are Linux based here in Manchester, but we took advice from Liverpool
who are Windows based, so it should all work.

Here is our web page on the topic:
<http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-ex
ample-in-c/>
The has_checkpointing is just a local thing: you don't need it if all your
clients are set up identically and to support user checkpointing. To that
end, I believe they need
WANT_VACATE = True
in all their local configs. Then all should work.

Here is the Liverpool page
<http://condor.liv.ac.uk/checkpoint/>

I'm not sure why you have periodic_release=True, we don't use that. But if
you could explain it to me, maybe we should be!

Hope that helps.
-Ian


On 20/05/2013 12:41, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:

>Dear all,
> 
>I am coming back to the hot ³checkpointing the vanilla universe issue²
>under windows. I have a fortran 90 code which can run for a while. For
>longer runs, condor¹s performance drops significantly as jobs get
>interrupted by users and with the lack of a
> native checkpointing function and the inability to use the ³standard"
>universe the code has to restart from the beginning on a different
>machine. As a result seldom any jobs manage to finish off. I changed my
>source code to accommodate a check pointing feature.
> The code reads a ³flag² file (which is also one of the initial input
>files) and creates a checkpoint file with all the required data to be
>able to resume a job from where it was left off. The flag file initially
>contains a ³0². As soon as a given elapsed time
> passes (1hr and then every one hour from there onwards) the first
>checkpoint takes place. The flag file is supposed to be updated with a
>value of ³1² and a ³history² file is created saving the required
>checkpoint data. The idea is that when the code gets evicted,
> it will read the input file as ³1² and then use the ³history² file to
>read the last checkpoint data and resume from where it left off. This
>doesn¹t seem to be working. I am quite confused if the flag file gets
>updated and re-read upon re-starting the job.
> I am also not sure if condor will be able to read the ³history² file
>which was created as an output file and is not in the initial input files
>list.
>
> 
>Any ideas?
> 
>This is the current submit file I am using to accommodate the checkpoint
>function:
> 
>************************
>************************
>Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
>Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
><file://\\htcondor\htcondorjobs\\****\T2\mds.exe>
>initialdir = \\htcondor\htcondorjobs\\****\T2
><file://\\htcondor\htcondorjobs\\****\T2>
>transfer_input_files = mds.exe, input, flag
>Universe = vanilla
>Getenv = False
>output = Test_cores.out
>error = Test_cores.err
>log = Test_cores.log
>should_transfer_files = ALWAYS
>when_to_transfer_output = ON_EXIT_OR_EVICT
>periodic_release = TRUE
>Queue 250
>************************
>************************
> 
>Regards
>Antonis
>
>
>


-- 
Ian Cottam x61851
IT Services Research Lead
IT Services -- supporting research
The University of Manchester
[ATD - Action This Day - Churchill]