[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] restart after checkpointing



On Sun, Mar 14, 2004 at 04:38:23PM -0600, duncan brown wrote:

> I'm trying to make sure that out condor pool is correctly restarting
> code after checkpointing. We have set a checkpoint server up on each
> node of the beowulf and it looks like when code is preemted it
> correctly goes to the checkpoint server.
> 
> However, I can't tell from the StarterLog if it is correctly started
> the checkpointed code again. I have atached a log file with an
> example. It looks like the node is correctly starting the
> checkpointed version, but the only thing that confused me was the
> execve() message. It seems that the checkpointed image is being
> started with the full command line options rather than the
> -_condor_restart option. (This is how I tested with standalone
> checkpointing).

Duncan,

As far as I know, your job always starts the initial checkpoint image
with the "-_condor_cmd_fd" parameter.  It is then told to restore a
checkpoint image.  You want to look in the "ShadowLog" on the submit
machine to see the checkpoint being loaded.  For example:


3/14 12:49:02 (2194.176) (6697):Requesting Primary Starter
3/14 12:49:02 (2194.176) (6697):Shadow: Request to run a job was ACCEPTED
3/14 12:49:02 (2194.176) (6697):Shadow: RSC_SOCK connected, fd = 17
3/14 12:49:02 (2194.176) (6697):Shadow: CLIENT_LOG connected, fd = 18
3/14 12:49:02 (2194.176) (6697):My_Filesystem_Domain = "lmcg.wisc.edu"
3/14 12:49:02 (2194.176) (6697):My_UID_Domain = "lmcg.wisc.edu"
3/14 12:49:02 (2194.176) (6697):        Entering pseudo_get_file_stream
3/14 12:49:02 (2194.176) (6697):        file = "/home/condor/LINUX/hosts/condor/spool/cluster2194.ickpt.subproc0"
3/14 12:49:02 (2194.176) (6697):        144.92.101.149
3/14 12:49:02 (2194.176) (6697):        144.92.101.149
3/14 12:49:02 (2194.176) (6697):Reaped child status - pid 6698 exited with status 0
3/14 12:49:02 (2194.176) (6697):Read: condor_restart:
3/14 12:49:02 (2194.176) (6697):Read: Checkpoint file name is "/home/condor/LINUX/hosts/condor/spool/cluster2194.proc176.subproc0"
3/14 12:49:02 (2194.176) (6697):        Entering pseudo_get_file_stream
3/14 12:49:02 (2194.176) (6697):        file = "/home/condor/LINUX/hosts/condor/spool/cluster2194.proc176.subproc0"
3/14 12:49:02 (2194.176) (6697):        128.105.121.41
3/14 12:49:02 (2194.176) (6697):RestoreRequest returned 0 using port 39753
3/14 12:49:02 (2194.176) (6697):Read: Opened "/home/condor/LINUX/hosts/condor/spool/cluster2194.proc176.subproc0" via file stream
3/14 12:49:02 (2194.176) (6697):Read: Read headers OK
3/14 12:49:02 (2194.176) (6697):Read: Read SegMap[0](DATA) OK
3/14 12:49:02 (2194.176) (6697):Read: Read SegMap[1](STACK) OK
3/14 12:49:02 (2194.176) (6697):Read: Read all SegMaps OK
3/14 12:49:02 (2194.176) (6697):Read: Found a DATA block, increasing heap from 0xab0b000 to 0x4bdd3000
3/14 12:49:02 (2194.176) (6697):Read: About to overwrite 0x43c40000 bytes starting at 0x8193000(DATA)
3/14 12:51:14 (2194.176) (6697):Read: About to overwrite 0x4efff bytes starting at 0xbffb1000(STACK)
3/14 12:51:14 (2194.176) (6697):Read: USER PROC: CHECKPOINT IMAGE RECEIVED OK
3/14 12:51:14 (2194.176) (6697):Read: About to restore file state


Here you can find the file name and see as the memory is being
overwritten by the checkpoint image.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
			University of Wisconsin, Madison
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>