[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universe under windows



Did you check every node has
Want_vacate = true
In all local configs? And restart them?
-Ian



On 20 May 2013, at 17:04, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:

> Hello Ian,
> 
> thank you for the useful notes. I have now been changing my source code to be able to extract more info from condor regarding what is going on (added some counters to count how many times the code has restarted etc).
> 
> To my surprise, I have intentionally vacated the job to see what happens and the following error is logged by condor each time a job is evacuated (asterisks added to protect privacy):
> 
> Error from slot2@***.uk: STARTER at 155.*** failed to send file(s) to <155.***>; SHADOW at 155.*** failed to write to file c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp\_condor_stdout: (errno 2) No such file or directory
>   Code 12 Subcode 2
> 
> The positive thing though is that I can read and write inside the same file (read the input flag and then I was able to change it).
> 
> Checking the spool directory, the folder c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp is present but always empty. It appears that at every job vacation the code is not transferring the files inside the spool directory to be able to restart the code to a different node. The counters from my source code indicate the same (the code basically cannot be restarted and is completed as long as it is left running uninterrupted on a single node).
> 
> Any ideas?
> 
> P.S. periodic_release=True is to enable more frequent automatic releases of jobs getting held (I am running around 60k jobs).
> 
> Antonis
> 
> -----Original Message----- From: Ian Cottam
> Sent: Monday, May 20, 2013 1:51 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] checkpointing the vanilla universe under windows
> 
> We are Linux based here in Manchester, but we took advice from Liverpool
> who are Windows based, so it should all work.
> 
> Here is our web page on the topic:
> <http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-ex
> ample-in-c/>
> The has_checkpointing is just a local thing: you don't need it if all your
> clients are set up identically and to support user checkpointing. To that
> end, I believe they need
> WANT_VACATE = True
> in all their local configs. Then all should work.
> 
> Here is the Liverpool page
> <http://condor.liv.ac.uk/checkpoint/>
> 
> I'm not sure why you have periodic_release=True, we don't use that. But if
> you could explain it to me, maybe we should be!
> 
> Hope that helps.
> -Ian
> 
> 
> On 20/05/2013 12:41, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:
> 
>> Dear all,
>> 
>> I am coming back to the hot ³checkpointing the vanilla universe issue²
>> under windows. I have a fortran 90 code which can run for a while. For
>> longer runs, condor¹s performance drops significantly as jobs get
>> interrupted by users and with the lack of a
>> native checkpointing function and the inability to use the ³standard"
>> universe the code has to restart from the beginning on a different
>> machine. As a result seldom any jobs manage to finish off. I changed my
>> source code to accommodate a check pointing feature.
>> The code reads a ³flag² file (which is also one of the initial input
>> files) and creates a checkpoint file with all the required data to be
>> able to resume a job from where it was left off. The flag file initially
>> contains a ³0². As soon as a given elapsed time
>> passes (1hr and then every one hour from there onwards) the first
>> checkpoint takes place. The flag file is supposed to be updated with a
>> value of ³1² and a ³history² file is created saving the required
>> checkpoint data. The idea is that when the code gets evicted,
>> it will read the input file as ³1² and then use the ³history² file to
>> read the last checkpoint data and resume from where it left off. This
>> doesn¹t seem to be working. I am quite confused if the flag file gets
>> updated and re-read upon re-starting the job.
>> I am also not sure if condor will be able to read the ³history² file
>> which was created as an output file and is not in the initial input files
>> list.
>> 
>> 
>> Any ideas?
>> 
>> This is the current submit file I am using to accommodate the checkpoint
>> function:
>> 
>> ************************
>> ************************
>> Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
>> Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
>> <file://\\htcondor\htcondorjobs\\****\T2\mds.exe>
>> initialdir = \\htcondor\htcondorjobs\\****\T2
>> <file://\\htcondor\htcondorjobs\\****\T2>
>> transfer_input_files = mds.exe, input, flag
>> Universe = vanilla
>> Getenv = False
>> output = Test_cores.out
>> error = Test_cores.err
>> log = Test_cores.log
>> should_transfer_files = ALWAYS
>> when_to_transfer_output = ON_EXIT_OR_EVICT
>> periodic_release = TRUE
>> Queue 250
>> ************************
>> ************************
>> 
>> Regards
>> Antonis
> 
> 
> -- 
> Ian Cottam x61851
> IT Services Research Lead
> IT Services -- supporting research
> The University of Manchester
> [ATD - Action This Day - Churchill]
> 
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/ 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/