[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] unable to restart from a "checkpointed" process



Thanks Daniel,

Here's the complete scenario with debug turned on. Seems like it dies trying to restore the
shared libraries the second time through.

Adrian
---
%>ckpt_test -_condor_D_ALL
User Job - $CondorPlatform: SUN4X-SOLARIS29 $
User Job - $CondorVersion: 6.6.11 Mar 23 2006 $
Condor: Notice: Will checkpoint to ckpt_test.ckpt
Condor: Notice: Remote system calls disabled.
i: 0
i: 1
i: 2
i: 3
i: 4
i: 5
i: 6
i: 7
Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
   logical name: default stdin
   offset:       0
   dups:         1
   open flags:   0x0
   not currently bound to a url.
fd 1
   logical name: default stdout
   offset:       40
   dups:         1
   open flags:   0x1
   url:          fd:1
   size:         40
   opens:        1
fd 2
   logical name: default stderr
   offset:       0
   dups:         1
   open flags:   0x1
   not currently bound to a url.
working dir = /import/dais-data/members/at136596/lgf
Done saving file state
About to update MyImage
About to ask the OS for segments...
I should have 30 segments...
Image::AddSegment: name=[DATA], start=[0xb6000], end=[0x170000], len=[0xba000], prot=[0x3]
I just added the data segment
Skipping Text Segment
Don't add DATA segment again
Don't add DATA segment again
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef40000], end=[0xfef5c000], len=[0x1c000], prot=[0x5]
   len:[0x1c000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef6a000], end=[0xfef6e000], len=[0x4000], prot=[0x7]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef80000], end=[0xff02c000], len=[0xac000], prot=[0x5]
   len:[0xac000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff03c000], end=[0xff042000], len=[0x6000], prot=[0x7]
   len:[0x6000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff042000], end=[0xff044000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff080000], end=[0xff10e000], len=[0x8e000], prot=[0x5]
   len:[0x8e000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff11e000], end=[0xff128000], len=[0xa000], prot=[0x7]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff128000], end=[0xff12e000], len=[0x6000], prot=[0x7]
   len:[0x6000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff140000], end=[0xff142000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff150000], end=[0xff15a000], len=[0xa000], prot=[0x5]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff16a000], end=[0xff16c000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff170000], end=[0xff172000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff180000], end=[0xff1b8000], len=[0x38000], prot=[0x5]
   len:[0x38000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1c6000], end=[0xff1c8000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1d0000], end=[0xff1dc000], len=[0xc000], prot=[0x5]
   len:[0xc000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1ea000], end=[0xff1ec000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1ec000], end=[0xff1f0000], len=[0x4000], prot=[0x7]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff200000], end=[0xff36a000], len=[0x16a000], prot=[0x5]
   len:[0x16a000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff378000], end=[0xff382000], len=[0xa000], prot=[0x7]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff390000], end=[0xff394000], len=[0x4000], prot=[0x5]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3a4000], end=[0xff3a6000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3b0000], end=[0xff3b2000], len=[0x2000], prot=[0x5]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3c0000], end=[0xff3f0000], len=[0x30000], prot=[0x5]
   len:[0x30000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3f0000], end=[0xff3f2000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3f2000], end=[0xff3f4000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3fa000], end=[0xff3fc000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding Stack Segment
Image::AddSegment: name=[STACK], start=[0xffbe6000], end=[0xffc00000], len=[0x1a000], prot=[0x3]
stack start = 0xffbe6000, stack end = 0xffc00000
Pos: 763712
Pos: 878400
Pos: 894784
Pos: 1599296
Pos: 1623872
Pos: 1632064
Pos: 2213696
Pos: 2254656
Pos: 2279232
Pos: 2287424
Pos: 2328384
Pos: 2336576
Pos: 2344768
Pos: 2574144
Pos: 2582336
Pos: 2631488
Pos: 2639680
Pos: 2656064
Pos: 4138816
Pos: 4179776
Pos: 4196160
Pos: 4204352
Pos: 4212544
Pos: 4409152
Pos: 4417344
Pos: 4425536
Pos: 4433728
Pos: 4540224
Size of ckpt image = 4540224 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ckpt_test.ckpt
Checkpoint name is "ckpt_test.ckpt"
Tmp name is "ckpt_test.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
Writing compressed segments...
Wrote Segment[0] of type DATA -> OK
Wrote Segment[1] of type SHARED LIB -> OK
Wrote Segment[2] of type SHARED LIB -> OK
Wrote Segment[3] of type SHARED LIB -> OK
Wrote Segment[4] of type SHARED LIB -> OK
Wrote Segment[5] of type SHARED LIB -> OK
Wrote Segment[6] of type SHARED LIB -> OK
Wrote Segment[7] of type SHARED LIB -> OK
Wrote Segment[8] of type SHARED LIB -> OK
Wrote Segment[9] of type SHARED LIB -> OK
Wrote Segment[10] of type SHARED LIB -> OK
Wrote Segment[11] of type SHARED LIB -> OK
Wrote Segment[12] of type SHARED LIB -> OK
Wrote Segment[13] of type SHARED LIB -> OK
Wrote Segment[14] of type SHARED LIB -> OK
Wrote Segment[15] of type SHARED LIB -> OK
Wrote Segment[16] of type SHARED LIB -> OK
Wrote Segment[17] of type SHARED LIB -> OK
Wrote Segment[18] of type SHARED LIB -> OK
Wrote Segment[19] of type SHARED LIB -> OK
Wrote Segment[20] of type SHARED LIB -> OK
Wrote Segment[21] of type SHARED LIB -> OK
Wrote Segment[22] of type SHARED LIB -> OK
Wrote Segment[23] of type SHARED LIB -> OK
Wrote Segment[24] of type SHARED LIB -> OK
Wrote Segment[25] of type SHARED LIB -> OK
Wrote Segment[26] of type SHARED LIB -> OK
Wrote Segment[27] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "ckpt_test.ckpt.tmp" to "ckpt_test.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Ckpt exit
User signal 2
%>ckpt_test -_condor_restart ckpt_test.ckpt
Condor: Notice: Will restart from ckpt_test.ckpt
Restoring a SHARED LIB segment
About to overwrite 0x1c000 bytes starting at 0xfef40000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x4000 bytes starting at 0xfef6a000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0xac000 bytes starting at 0xfef80000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x6000 bytes starting at 0xff03c000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff042000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x8e000 bytes starting at 0xff080000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0xa000 bytes starting at 0xff11e000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x6000 bytes starting at 0xff128000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff140000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0xa000 bytes starting at 0xff150000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff16a000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff170000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x38000 bytes starting at 0xff180000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff1c6000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0xc000 bytes starting at 0xff1d0000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff1ea000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x4000 bytes starting at 0xff1ec000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x16a000 bytes starting at 0xff200000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0xa000 bytes starting at 0xff378000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x4000 bytes starting at 0xff390000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff3a4000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff3b0000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x30000 bytes starting at 0xff3c0000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff3f0000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff3f2000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xff3fa000(SHARED LIB)
About to execute on TmpStk
About to execute on tmpstack.
Beginning Execution on TmpStack.
RestoreStack() Entrance!
Restoring a STACK segment
About to overwrite 0x1a000 bytes starting at 0xffbe6000(STACK)
RestoreStack() Exit!
About to restore file state
CondorFileTable::resume
working dir = /import/dais-data/members/at136596/lgf

OPEN FILE TABLE:
fd 0
   logical name: default stdin
   offset:       0
   dups:         1
   open flags:   0x0
   not currently bound to a url.
fd 1
   logical name: default stdout
   offset:       40
   dups:         1
   open flags:   0x1
   not currently bound to a url.
fd 2
   logical name: default stderr
   offset:       0
   dups:         1
   open flags:   0x1
   not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
i: 8
i: 9
i: 10
i: 11
i: 12
i: 13
i: 14
i: 15
Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
   logical name: default stdin
   offset:       0
   dups:         1
   open flags:   0x0
   not currently bound to a url.
fd 1
   logical name: default stdout
   offset:       86
   dups:         1
   open flags:   0x1
   url:          fd:1
   size:         86
   opens:        1
fd 2
   logical name: default stderr
   offset:       0
   dups:         1
   open flags:   0x1
   not currently bound to a url.
working dir = /import/dais-data/members/at136596/lgf
Done saving file state
About to update MyImage
About to ask the OS for segments...
I should have 34 segments...
Image::AddSegment: name=[DATA], start=[0xb6000], end=[0x170000], len=[0xba000], prot=[0x3]
I just added the data segment
Skipping Text Segment
Don't add DATA segment again
Don't add DATA segment again
Adding SHARED LIB
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfe542000], end=[0xfe544000], len=[0x2000], prot=[0x3]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfe544000], end=[0xfe554000], len=[0x10000], prot=[0x3]
   len:[0x10000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfe554000], end=[0xfe55e000], len=[0xa000], prot=[0x3]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef40000], end=[0xfef5c000], len=[0x1c000], prot=[0x7]
   len:[0x1c000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef6a000], end=[0xfef6e000], len=[0x4000], prot=[0x7]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xfef80000], end=[0xff02c000], len=[0xac000], prot=[0x7]
   len:[0xac000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff03c000], end=[0xff042000], len=[0x6000], prot=[0x7]
   len:[0x6000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff042000], end=[0xff044000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff080000], end=[0xff10e000], len=[0x8e000], prot=[0x7]
   len:[0x8e000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff11e000], end=[0xff128000], len=[0xa000], prot=[0x7]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff128000], end=[0xff12e000], len=[0x6000], prot=[0x7]
   len:[0x6000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff140000], end=[0xff142000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff150000], end=[0xff15a000], len=[0xa000], prot=[0x7]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff16a000], end=[0xff16c000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff170000], end=[0xff172000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff180000], end=[0xff1b8000], len=[0x38000], prot=[0x7]
   len:[0x38000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1c6000], end=[0xff1c8000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1d0000], end=[0xff1dc000], len=[0xc000], prot=[0x7]
   len:[0xc000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1ea000], end=[0xff1ec000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff1ec000], end=[0xff1f0000], len=[0x4000], prot=[0x7]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff200000], end=[0xff36a000], len=[0x16a000], prot=[0x7]
   len:[0x16a000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff378000], end=[0xff382000], len=[0xa000], prot=[0x7]
   len:[0xa000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff390000], end=[0xff394000], len=[0x4000], prot=[0x7]
   len:[0x4000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3a4000], end=[0xff3a6000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3b0000], end=[0xff3b2000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3c0000], end=[0xff3f0000], len=[0x30000], prot=[0x7]
   len:[0x30000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3f0000], end=[0xff3f2000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3f2000], end=[0xff3f4000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding SHARED LIB
Image::AddSegment: name=[SHARED LIB], start=[0xff3fa000], end=[0xff3fc000], len=[0x2000], prot=[0x7]
   len:[0x2000]
Adding Stack Segment
Image::AddSegment: name=[STACK], start=[0xffbe6000], end=[0xffc00000], len=[0x1a000], prot=[0x3]
stack start = 0xffbe6000, stack end = 0xffc00000
Pos: 763808
Pos: 772000
Pos: 837536
Pos: 878496
Pos: 993184
Pos: 1009568
Pos: 1714080
Pos: 1738656
Pos: 1746848
Pos: 2328480
Pos: 2369440
Pos: 2394016
Pos: 2402208
Pos: 2443168
Pos: 2451360
Pos: 2459552
Pos: 2688928
Pos: 2697120
Pos: 2746272
Pos: 2754464
Pos: 2770848
Pos: 4253600
Pos: 4294560
Pos: 4310944
Pos: 4319136
Pos: 4327328
Pos: 4523936
Pos: 4532128
Pos: 4540320
Pos: 4548512
Pos: 4655008
Size of ckpt image = 4655008 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ckpt_test.ckpt
Checkpoint name is "ckpt_test.ckpt"
Tmp name is "ckpt_test.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
Writing compressed segments...
Wrote Segment[0] of type DATA -> OK
Wrote Segment[1] of type SHARED LIB -> OK
Wrote Segment[2] of type SHARED LIB -> OK
Wrote Segment[3] of type SHARED LIB -> OK
Wrote Segment[4] of type SHARED LIB -> OK
Wrote Segment[5] of type SHARED LIB -> OK
Wrote Segment[6] of type SHARED LIB -> OK
Wrote Segment[7] of type SHARED LIB -> OK
Wrote Segment[8] of type SHARED LIB -> OK
Wrote Segment[9] of type SHARED LIB -> OK
Wrote Segment[10] of type SHARED LIB -> OK
Wrote Segment[11] of type SHARED LIB -> OK
Wrote Segment[12] of type SHARED LIB -> OK
Wrote Segment[13] of type SHARED LIB -> OK
Wrote Segment[14] of type SHARED LIB -> OK
Wrote Segment[15] of type SHARED LIB -> OK
Wrote Segment[16] of type SHARED LIB -> OK
Wrote Segment[17] of type SHARED LIB -> OK
Wrote Segment[18] of type SHARED LIB -> OK
Wrote Segment[19] of type SHARED LIB -> OK
Wrote Segment[20] of type SHARED LIB -> OK
Wrote Segment[21] of type SHARED LIB -> OK
Wrote Segment[22] of type SHARED LIB -> OK
Wrote Segment[23] of type SHARED LIB -> OK
Wrote Segment[24] of type SHARED LIB -> OK
Wrote Segment[25] of type SHARED LIB -> OK
Wrote Segment[26] of type SHARED LIB -> OK
Wrote Segment[27] of type SHARED LIB -> OK
Wrote Segment[28] of type SHARED LIB -> OK
Wrote Segment[29] of type SHARED LIB -> OK
Wrote Segment[30] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "ckpt_test.ckpt.tmp" to "ckpt_test.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Ckpt exit
User signal 2
%>ckpt_test -_condor_restart ckpt_test.ckpt
Condor: Notice: Will restart from ckpt_test.ckpt
Restoring a SHARED LIB segment
About to overwrite 0x2000 bytes starting at 0xfe542000(SHARED LIB)
Restoring a SHARED LIB segment
About to overwrite 0x10000 bytes starting at 0xfe544000(SHARED LIB)
zlib (inflate): unknown compression method
Killed
%>

Daniel Forrest wrote:
Adrian,

Here's the scenario:

1. start program that prints a count every second.
2. checkpoint the process (send SIGSTOP)
3. restart the process
4. checkpoint the process (send SIGSTOP)
5. restart the process

I assume you actually mean SIGTSTP since that is the checkpoint and
exit signal.

% ckpt_test

I would suggest running it again with debug enabled:

% ckpt_test -_condor_D_ALL

...

% ckpt_test -_condor_restart ckpt_test.ckpt

You don't need the debug here because it was saved in the checkpoint.

This should provide a clue as to what is going on when it dies.