[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universe under windows



Hello Ian,

thank you for the useful notes. I have now been changing my source code to be able to extract more info from condor regarding what is going on (added some counters to count how many times the code has restarted etc).

To my surprise, I have intentionally vacated the job to see what happens and the following error is logged by condor each time a job is evacuated (asterisks added to protect privacy):

Error from slot2@***.uk: STARTER at 155.*** failed to send file(s) to <155.***>; SHADOW at 155.*** failed to write to file c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
   Code 12 Subcode 2

The positive thing though is that I can read and write inside the same file (read the input flag and then I was able to change it).

Checking the spool directory, the folder c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp is present but always empty. It appears that at every job vacation the code is not transferring the files inside the spool directory to be able to restart the code to a different node. The counters from my source code indicate the same (the code basically cannot be restarted and is completed as long as it is left running uninterrupted on a single node).

Any ideas?

P.S. periodic_release=True is to enable more frequent automatic releases of jobs getting held (I am running around 60k jobs).

Antonis

-----Original Message----- From: Ian Cottam
Sent: Monday, May 20, 2013 1:51 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] checkpointing the vanilla universe under windows

We are Linux based here in Manchester, but we took advice from Liverpool
who are Windows based, so it should all work.

Here is our web page on the topic:
<http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-ex
ample-in-c/>
The has_checkpointing is just a local thing: you don't need it if all your
clients are set up identically and to support user checkpointing. To that
end, I believe they need
WANT_VACATE = True
in all their local configs. Then all should work.

Here is the Liverpool page
<http://condor.liv.ac.uk/checkpoint/>

I'm not sure why you have periodic_release=True, we don't use that. But if
you could explain it to me, maybe we should be!

Hope that helps.
-Ian


On 20/05/2013 12:41, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:

Dear all,

I am coming back to the hot ³checkpointing the vanilla universe issue²
under windows. I have a fortran 90 code which can run for a while. For
longer runs, condor¹s performance drops significantly as jobs get
interrupted by users and with the lack of a
native checkpointing function and the inability to use the ³standard"
universe the code has to restart from the beginning on a different
machine. As a result seldom any jobs manage to finish off. I changed my
source code to accommodate a check pointing feature.
The code reads a ³flag² file (which is also one of the initial input
files) and creates a checkpoint file with all the required data to be
able to resume a job from where it was left off. The flag file initially
contains a ³0². As soon as a given elapsed time
passes (1hr and then every one hour from there onwards) the first
checkpoint takes place. The flag file is supposed to be updated with a
value of ³1² and a ³history² file is created saving the required
checkpoint data. The idea is that when the code gets evicted,
it will read the input file as ³1² and then use the ³history² file to
read the last checkpoint data and resume from where it left off. This
doesn¹t seem to be working. I am quite confused if the flag file gets
updated and re-read upon re-starting the job.
I am also not sure if condor will be able to read the ³history² file
which was created as an output file and is not in the initial input files
list.


Any ideas?

This is the current submit file I am using to accommodate the checkpoint
function:

************************
************************
Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
<file://\\htcondor\htcondorjobs\\****\T2\mds.exe>
initialdir = \\htcondor\htcondorjobs\\****\T2
<file://\\htcondor\htcondorjobs\\****\T2>
transfer_input_files = mds.exe, input, flag
Universe = vanilla
Getenv = False
output = Test_cores.out
error = Test_cores.err
log = Test_cores.log
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_release = TRUE
Queue 250
************************
************************

Regards
Antonis





--
Ian Cottam x61851
IT Services Research Lead
IT Services -- supporting research
The University of Manchester
[ATD - Action This Day - Churchill]




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/