Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...

Date: Mon, 19 Sep 2011 15:49:24 -0500
From: "John (TJ) Knoeller" <johnkn@xxxxxxxxxxx>
Subject: Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...

Currently for vanilla universe jobs, the checkpoint files are onlymoved off of the execute machine when the job is vacated or evicted.


In order to get this to work, the submit file for the job needs to have

    when_to_transfer_output = ON_EXIT_OR_EVICT

And the submit machine must still be running. (i.e. the librarian needsto shut that machine down after the execute machines).


That's about the best you can hope for with the current version of Condor.


On 9/16/2011 9:52 AM, Rob wrote:

Hi,

Here are my observation results on checkpointing with Windows:

A running program gets indeed the CTRL_SHUTDOWN_EVENT when Windows shuts down (and there's enough time to create checkpoint files on the local machine), but by then apparently Condor and/or the network are already in a "dead-enough" status, so that communicating with the condor master cannot happen anymore.
Upon boot up, the Windows computer does a clean up of the remainders of previous jobs, so that the job's history/checkpoint data is lost.

The only remedy here is to do regular checkpointing.

But how can I tell Condor to transfer the checkpoint files from the pool PC to the master, without evicting the job?

Thanks,

Rob

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

References:
- Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
  - From: Rob

Prev by Date: [Condor-users] DAG of DAG or Loop in DAG?
Next by Date: Re: [Condor-users] Start process in specific session/window station on terminal server
Previous by thread: Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
Next by thread: Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...