Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] A Problem while restarting a checkpoint file

Date: Fri, 28 Mar 2008 06:54:23 +0000
From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] A Problem while restarting a checkpoint file

Hi Tanzima,

From

http://www.cs.wisc.edu/condor/manual/v6.8/1_4Current_Limitations.html,see point 4 of the section "*Limitations on Jobs which can Checkpointed"

*4. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed.Condor reserves these signals for its own use. Sending or receiving allother signals /is/ allowed.


Cheers,
Mark

Tanzima Zerin Islam wrote:

Hi,
I have written a shell script that runs a helloworld (C) program fromshell script and from that shell script sendskill -USR2 signal to that process. I have used condor_compile to linkmy executable with condor's checkpoint library.This shell script also has a way to indentify whether the processneeds to restart from an existing ckpt file or is a new application.Both way it works fine until - after the job is restarted, when againmy shell script sends a kill -USR2 signal, it terminated abnormally.
The debug output shows that working dir is null.
The debug output is as follows:

Test.sh: Sending checkpoint signal to process: 22037
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       820
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         820
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir =
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000],length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff],length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir =
Condor: Error: Couldn't move to '��p' (No such file or directory).Please fix it.
./job.sh: line 61: 22037 Killed ./helloWorld-_condor_restart helloWorld.ckpt
-----------------------
Now see the working dir line -- why does it not show the workingdirectory? I have restarted the process as:
./helloWorld -_condor_restart helloWorld.ckpt
So the problem is: After a job is restarted from last checkpoint - itcannot be checkpointed again by sending USR2 or CTRL+Z signal.
Does anyone know any remedy?

-- Tan
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/


--
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.escience.cam.ac.uk/~mcal00

Follow-Ups:
- Re: [Condor-users] A Problem while restarting a checkpoint file
  - From: Daniel Forrest

References:
- [Condor-users] A Problem while restarting a checkpoint file
  - From: Tanzima Zerin Islam

Prev by Date: [Condor-users] A Problem while restarting a checkpoint file
Next by Date: Re: [Condor-users] A Problem while restarting a checkpoint file
Previous by thread: [Condor-users] A Problem while restarting a checkpoint file
Next by thread: Re: [Condor-users] A Problem while restarting a checkpoint file
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] A Problem while restarting a checkpoint file