[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] quick question: is periodic vacate possible



That was indeed me,

We found it problematic as if a job happened to exit for some reason
not associated with the check point process the infrastructure treated
it as a successful checkpoint. This was infrequent but happened just
enough to annoy, especially when the reason it exited was because it
had actually completed!

Instead we ended up implementing our own system where by we can
specify, in code, how a job is broken down into steps and deal with
committing steps atomically, executing branches of it on the farm,
handling iteration etc. That was a lot of effort to do (best part of a
month plus plenty of tuning/tweaking/bug fixes) but the payoff
afterwards is huge and it actually ends up being easier to deal with
in terms of defining check point friendly subsets of a job.

If you wanted to implement the vanilla checkpointing then Ian's quick
start guide is the way to go.

Matt

On Monday, June 21, 2010, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
> Some one is doing this now. Checkpointing vanilla jobs with a signal from Condor. I think it's Matt Hope. I recall posts to the list about this very topic maybe 2 years back. It's possible.
>
> - Ian
>
> Sent from my iPhone
>