[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universe under windows



Dear Antonis,

 

I don’t have any ideas, but I would be interested to see where you get to on this. I would be interested in using the checkpointing (dump files) in LS-DYNA to restart an analysis. Has anyone been using LS-DYNA under windows? It might be a similar case.

 

Andrew

 

 

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Antonis Sergis
Sent: 20 May 2013 19:42
To: Condor Blog
Subject: [HTCondor-users] checkpointing the vanilla universe under windows

 

Dear all,

 

I am coming back to the hot “checkpointing the vanilla universe issue” under windows. I have a fortran 90 code which can run for a while. For longer runs, condor’s performance drops significantly as jobs get interrupted by users and with the lack of a native checkpointing function and the inability to use the “standard" universe the code has to restart from the beginning on a different machine. As a result seldom any jobs manage to finish off. I changed my source code to accommodate a check pointing feature. The code reads a “flag” file (which is also one of the initial input files) and creates a checkpoint file with all the required data to be able to resume a job from where it was left off. The flag file initially contains a “0”. As soon as a given elapsed time passes (1hr and then every one hour from there onwards) the first checkpoint takes place. The flag file is supposed to be updated with a value of “1” and a “history” file is created saving the required checkpoint data. The idea is that when the code gets evicted, it will read the input file as “1” and then use the “history” file to read the last checkpoint data and resume from where it left off. This doesn’t seem to be working. I am quite confused if the flag file gets updated and re-read upon re-starting the job. I am also not sure if condor will be able to read the “history” file which was created as an output file and is not in the initial input files list.

 

Any ideas?

 

This is the current submit file I am using to accommodate the checkpoint function:

 

************************

************************

Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")

transfer_input_files = mds.exe, input, flag

Universe = vanilla

Getenv = False

output = Test_cores.out

error = Test_cores.err

log = Test_cores.log

should_transfer_files = ALWAYS

when_to_transfer_output = ON_EXIT_OR_EVICT

periodic_release = TRUE

Queue 250

************************

************************

 

Regards

Antonis

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses