[HTCondor-users] checkpointing the vanilla universe under windows

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Dear all,

I am coming back to the hot “checkpointing the vanilla universe issue” under windows. I have a fortran 90 code which can run for a while. For longer runs, condor’s performance drops significantly as jobs get interrupted by users and with the lack of a native checkpointing function and the inability to use the “standard" universe the code has to restart from the beginning on a different machine. As a result seldom any jobs manage to finish off. I changed my source code to accommodate a check pointing feature. The code reads a “flag” file (which is also one of the initial input files) and creates a checkpoint file with all the required data to be able to resume a job from where it was left off. The flag file initially contains a “0”. As soon as a given elapsed time passes (1hr and then every one hour from there onwards) the first checkpoint takes place. The flag file is supposed to be updated with a value of “1” and a “history” file is created saving the required checkpoint data. The idea is that when the code gets evicted, it will read the input file as “1” and then use the “history” file to read the last checkpoint data and resume from where it left off. This doesn’t seem to be working. I am quite confused if the flag file gets updated and re-read upon re-starting the job. I am also not sure if condor will be able to read the “history” file which was created as an output file and is not in the initial input files list.

Any ideas?

This is the current submit file I am using to accommodate the checkpoint function:

************************

Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")

Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe

initialdir = \\htcondor\htcondorjobs\\****\T2

transfer_input_files = mds.exe, input, flag

Universe = vanilla

Getenv = False

output = Test_cores.out

error = Test_cores.err

log = Test_cores.log

should_transfer_files = ALWAYS

when_to_transfer_output = ON_EXIT_OR_EVICT

periodic_release = TRUE

Queue 250

************************

Regards

Antonis

Mailing List Archives

Public Access

[HTCondor-users] checkpointing the vanilla universe under windows