[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] A Problem while restarting a checkpoint file

 I have written a shell script that runs a helloworld (C) program from shell script and from that shell script sends
kill -USR2 signal to that process. I have used condor_compile to link my executable with condor's checkpoint library.
This shell script also has a way to indentify whether the process needs to restart from an existing ckpt file or is a new application.
Both way it works fine until - after the job is restarted, when again my shell script sends a kill -USR2 signal, it terminated abnormally.
The debug output shows that working dir is null.
The debug output is as follows:

Test.sh: Sending checkpoint signal to process: 22037
Saved signal state.
About to save file state

fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       820
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         820
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir =
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000], length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff], length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
working dir =
Condor: Error: Couldn't move to '��p' (No such file or directory).  Please fix it.

./job.sh: line 61: 22037 Killed                  ./helloWorld -_condor_restart helloWorld.ckpt

Now see the working dir line -- why does it not show the working directory? I have restarted the process as:
./helloWorld -_condor_restart helloWorld.ckpt

So the problem is: After a job is restarted from last checkpoint - it cannot be checkpointed again by sending USR2 or CTRL+Z signal.
Does anyone know any remedy?

-- Tan