[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Errno=14, taking checkpoint doesnot complete



Hi, I have an application compiled with condor_compile. I am trying to run it in standalone way using:
./executable input -_condor_D_ALL

then from another shell I am sending checkpoint signal : kill -USR2 pid
But this is what I get:

..............................................
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       315
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         315
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir = /home/yara/sbagchi/tislam/condorExperiments/spec_429.mcf
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x659000], end [0x694cd000]
Image::AddSegment: name=[DATA], start=[659000], end=[694cd000], length=[0x68e74000], prot=[0xffffffff00000000]
Adding a STACK segment: start[0x7fffbfa5d000], end [0x7fffbfa66fff]
Image::AddSegment: name=[STACK], start=[7fffbfa5d000], end=[7fffbfa66fff], length=[0x9fff], prot=[0x0]
Pos: 1759986720
Pos: 1760027679
Size of ckpt image = 1760027679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./mcf.ckpt
Checkpoint name is "./mcf.ckpt"
Tmp name is "./mcf.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x659000,len=0x68e74000)
I wrote 745472 bytes with write...
I wrote -1 bytes with write...
in SegMap::Write(): fd = 3, write_size=1759240192
errno=14, core_loc=70f000
Write() Segment[0] of type DATA -> FAILED
errno = 14, nbytes = -1
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir = /home/mcf

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       315
        dups:         1
        open flags:   0x1
        not currently bound to a url.
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code

..............................................
 This debug message clearly shows some error occurred so I only see mcf.ckpt.tmp being generated.
Any idea what errno=14 means? checkpoint's size might be the reason?

--Tan