[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to create a Checkpoint Server for window spools



Matt,
Thank you very much for taking your time to write such a detail explanation, we do have source level access to our applications but I don't know if our programmer will want to go and modify the code to implement these self-checkpoint feature on around 45 programs we utilized to run on condor. I liked your second suggestion since all our jobs run on remote share but to be honest I lack of knowledge to do the type of configuration you create to checkpoint your jobs. From what you described here seems to be pretty complex,
Thank you anyways again for your answer.

Respectfully,
Alex Alas
Systems Administrator
Fugro EarthData Inc.
Tel. 301-948-8550 x219 Fax 301-963-2064 E-mail: aalas@xxxxxxxxxxxxx 
7320 Executive Way, Frederick, MD  21704
Website: http://www.fugroearthdata.com


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matt Hope
Sent: Monday, October 26, 2009 12:53 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How to create a Checkpoint Server for window spools

checkpoint servers are really only relevant to the standard universe which is where jobs can be magically relinked to libraries which allow for transparent checkpointing and migration.

This is a unix only functionality (and restricted even then to certain flavours and combinations).

Theoretically you could achieve it via unix targeted binaries running on cygwin on windows. I would hate to even try it (though who knows someone else might have) but at this stage you're really better off just running unix!

windows does not support the standard universe and is unlikely to ever do so.

What may happen in the future is running your windows jobs as virtual machines (themselves hosted on any os, quite probably windows in your case) and to allow checkpointing by saving the state of the entire VM and, possibly, transplating a saved image from one box to another.
This is definitely something the condor team want to do (they have mentioned this in the past) as to a guide to when such functionality would be available at least to test you would have to get a view on that from the condor team.

You may find that you can rework your windows applications to self checkpoint. This is a not often used feature but, if you have source level access to your applications you can recode them to respond to the signals condor sends when it wants to trigger a pre-emption and, if you respond (exit) quickly enough then condor will not hard kill your process for a configurable interval. In this time you can try to write your restart state out to some remote location, thus allowing your code to restore its own state when it restarts.

This requires considerable work and is dependent on you being able to save your own state in a recoverable/restartable fashion.

If this sounds like something you would consider then take a look at http://www.cs.wisc.edu/condor/manual/v7.2/6_2Microsoft_Windows.html#SECTION00721000000000000000 and take a look at the WM_CLOSE reference.

I found a much better way to deal with this was to break down our large jobs into 'steps' and have each step be an atomically committed file system operation (changing the name of a temporary directory) on a remote share where all visible side effects of a step are present in the stored file and recoverable by the system which puts the steps together. Thus any step is itself a checkpoint, steps may be composed in a tree like fashion, with sequential and parallel composition well defined. 

A condor job is (or may be) spawned off for a particular step (which may be a leaf node or a composed one) and thus partial checkpointing may occur within a single job as it commits each sub step and there is no need to both trapping WM_CLOSE messages.

Writing this setup and making all our code play nice with it was a significant code infrastructure investment and requires some discipline to write correctly but has paid off already. F# makes much of this much easier as strong typing on the complex step hierarchy tends to be horrible in languages like c# without serious type inference.

Matt

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alas, Alex [FEDI]
Sent: 23 October 2009 22:18
To: Condor-Users Mail List
Subject: Re: [Condor-users] How to create a Checkpoint Server for window spools

Erik
Thanks a bunch for responding and clarifying my misconception that you could use checkpoint server in windows pools and that they are only meant for Unix pools. I experimented a few problems in the past weeks when the ups attached to the execute nodes flipped during the weekend and all the jobs that were running on those nodes were restarted from the beginning which delay the termination of the jobs for several hours. One of my users ask if there is a way to configure a checkpoint server. I thought I read somewhere that checkpoint servers could be configured in windows pools, that is why I asked. Do you know if a way to workaround that limitation?

Respectfully,
Alex Alas
Systems Administrator
Fugro EarthData Inc.

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Friday, October 23, 2009 4:31 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How to create a Checkpoint Server for windowspools

The checkpoint server only runs on Unix platforms.

There's nothing special about running a condor checkpoint server on a
VM - just follow any instruction on how to, say, install Linux under
VMWare, and then follow the Condor manual for how to install the
checkpoint server.

However, I have to ask, if you're only running Windows, why do you
want a checkpoint server? Condor for Windows cannot take advantage of
a checkpoint server.

-Erik

On Thu, Oct 22, 2009 at 11:53 AM, Alas, Alex [FEDI] <aalas@xxxxxxxxxxxxx> wrote:
> Hello to all,
>
> I am in the need to create a checkpoint server for my windows pool. I know
> this feature is built-in in Linux pools but these pools are not an option
> for me since all the executables are meant for windows systems. I heard it
> is possible to create a windows checkpoint servers using VM's; If anyone has
> done it, could you share some of your knowledge on how to do it?
>
> Thanks for your help in advance,
>
>
>
> Respectfully,
>
> Alex Alas
>
> Systems Administrator
> Fugro EarthData Inc.
>
> Tel. 301-948-8550 x219 Fax 301-963-2064 E-mail: aalas@xxxxxxxxxxxxx
>
> 7320 Executive Way, Frederick, MD  21704
>
> Website: http://www.fugroearthdata.com
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
----

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/