[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] A Problem while restarting a checkpoint file



Hi Tanzima,

From
http://www.cs.wisc.edu/condor/manual/v6.8/1_4Current_Limitations.html, see point 4 of the section "*Limitations on Jobs which can Checkpointed"

*4. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals /is/ allowed.

Cheers,
Mark

Tanzima Zerin Islam wrote:
Hi,
I have written a shell script that runs a helloworld (C) program from shell script and from that shell script sends kill -USR2 signal to that process. I have used condor_compile to link my executable with condor's checkpoint library. This shell script also has a way to indentify whether the process needs to restart from an existing ckpt file or is a new application. Both way it works fine until - after the job is restarted, when again my shell script sends a kill -USR2 signal, it terminated abnormally.
The debug output shows that working dir is null.
The debug output is as follows:

Test.sh: Sending checkpoint signal to process: 22037
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       820
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         820
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir =
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000], length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff], length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir =
Condor: Error: Couldn't move to '��p' (No such file or directory). Please fix it.

./job.sh: line 61: 22037 Killed ./helloWorld -_condor_restart helloWorld.ckpt


-----------------------
Now see the working dir line -- why does it not show the working directory? I have restarted the process as:
./helloWorld -_condor_restart helloWorld.ckpt

So the problem is: After a job is restarted from last checkpoint - it cannot be checkpointed again by sending USR2 or CTRL+Z signal.
Does anyone know any remedy?

-- Tan
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

--
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.escience.cam.ac.uk/~mcal00