[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] checkpointing the vanilla universe under windows

Dear all,
I am coming back to the hot “checkpointing the vanilla universe issue” under windows. I have a fortran 90 code which can run for a while. For longer runs, condor’s performance drops significantly as jobs get interrupted by users and with the lack of a native checkpointing function and the inability to use the “standard" universe the code has to restart from the beginning on a different machine. As a result seldom any jobs manage to finish off. I changed my source code to accommodate a check pointing feature. The code reads a “flag” file (which is also one of the initial input files) and creates a checkpoint file with all the required data to be able to resume a job from where it was left off. The flag file initially contains a “0”. As soon as a given elapsed time passes (1hr and then every one hour from there onwards) the first checkpoint takes place. The flag file is supposed to be updated with a value of “1” and a “history” file is created saving the required checkpoint data. The idea is that when the code gets evicted, it will read the input file as “1” and then use the “history” file to read the last checkpoint data and resume from where it left off. This doesn’t seem to be working. I am quite confused if the flag file gets updated and re-read upon re-starting the job. I am also not sure if condor will be able to read the “history” file which was created as an output file and is not in the initial input files list.
Any ideas?
This is the current submit file I am using to accommodate the checkpoint function:
Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
transfer_input_files = mds.exe, input, flag
Universe = vanilla
Getenv = False
output = Test_cores.out
error = Test_cores.err
log = Test_cores.log
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_release = TRUE
Queue 250