[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
- Date: Tue, 01 Mar 2016 09:46:21 +0000
- From: Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
I have not done that much research on checkpointing yet, so forgive my ignorance. I just have a question on the concept of checkpointing. Is the point to just give some sort of initial value to the job, or does checkpointing involve some sort off memory-dump from where the simulation can continue?
For example, if running a set of CFD jobs with Ansys, it is possible to tell the solver to 'start from this file'. Will that be considere checkpointing?
>From what I read here: http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-example-in-c/ this seems to be the case.
What is then the process of checkpointing? When the job gets a vacate signal, will it then run some checkpointing-routine? Or will it allways check for checkpoint information when a job starts?
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Cottam
Sent: 19. oktober 2015 17:21
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
On 19/10/2015 03:43, "HTCondor-users on behalf of David Herd"
<htcondor-users-bounces@xxxxxxxxxxx on behalf of d.herd@xxxxxxxxxxx> wrote:
>The jobs we run are mostly CFD and use Ansys. As such we canât link
>them HT Condor modules and it looks like we wonât be able to take
>checkpoints of our jobs.
We just encourage our users to write their own checkpointing code under the vanilla universe. We also have templates for e.g. C and MATLAB.
Basically, you have to check on startup for the existence of a checkpoint file and if present start the computation from the point its contents define; and then also periodically update it (or update it on evict).
Condor handles all the rest.
The very latest Condor (which we don't run) has a little more help for vanilla checkpointing, but it doesn't save the user a lot of code (basically you could just do the file write on evict bit, I think). If nearly all your users run Ansys, you could likely figure out a template for checkpointing that they could all copy.
Ian Cottam | IT Relationship Manager | IT Services | C38 Sackville Street Building | The University of Manchester | M13 9PL |
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: