[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs



Hi

I have not done that much research on checkpointing yet, so forgive my ignorance. I just have a question on the concept of checkpointing. Is the point to just give some sort of initial value to the job, or does checkpointing involve some sort off memory-dump from where the simulation can continue?

For example, if running a set of CFD jobs with Ansys, it is possible to tell the solver to 'start from this file'. Will that be considere checkpointing?
>From what I read here: http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-example-in-c/ this seems to be the case.

What is then the process of checkpointing? When the job gets a vacate signal, will it then run some checkpointing-routine? Or will it allways check for checkpoint information when a job starts?

Peter

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Cottam
Sent: 19. oktober 2015 17:21
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs

On 19/10/2015 03:43, "HTCondor-users on behalf of David Herd"
<htcondor-users-bounces@xxxxxxxxxxx on behalf of d.herd@xxxxxxxxxxx> wrote:

>The jobs we run are mostly CFD and use Ansys.  As such we canât link 
>them HT Condor modules and it looks like we wonât be able to take 
>checkpoints of our jobs.

We just encourage our users to write their own checkpointing code under the vanilla universe. We also have templates for e.g. C and MATLAB.
Basically, you have to check on startup for the existence of a checkpoint file and if present start the computation from the point its contents define; and then also periodically update it (or update it on evict).
Condor handles all the rest.

The very latest Condor (which we don't run) has a little more help for vanilla checkpointing, but it doesn't save the user a lot of code (basically you could just do the file write on evict bit, I think). If nearly all your users run Ansys, you could likely figure out a template for checkpointing that they could all copy.

regards
-Ian

--
Ian Cottam  | IT Relationship Manager | IT Services  | C38 Sackville Street Building  |  The University of Manchester  |  M13 9PL  |






_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/