[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with Condor standalone library



Sure, here it is:

$ ./test.sh
User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
User Job - $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
Condor: Notice: Will checkpoint to ./helloWorld.ckpt
Condor: Notice: Remote system calls disabled.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       36
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         36
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir = /autohome/u102/tislam/helloWorld
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000], length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff], length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Ckpt exit
User signal 2

---------------------------------
for restarting I did:
./helloWorld -_condor_restart helloWorld.ckpt

It prints:
-------------------------------

$ ./helloWorld -_condor_restart helloWorld.ckpt
Condor: Notice: Will restart from helloWorld.ckpt
About to execute on TmpStk
About to execute on tmpstack.
Beginning Execution on TmpStack.
RestoreStack() Entrance!
Restoring a STACK segment
About to overwrite 40959 bytes starting at 0x7fbfff6000(STACK)
RestoreStack() Exit!
About to restore file state
CondorFileTable::resume
working dir = /autohome/u102/tislam/helloWorld

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       36
        dups:         1
        open flags:   0x1
        not currently bound to a url.
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       93
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         93
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir =
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000], length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff], length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Ckpt exit
User signal 2

----
So it looks like finally it is working. I have another question. When I am running an application using standalone checkpoint library,
does that mean condor is executing my application on the submitter machine? Or condor can send the application off to another machine as it usually does when a job is submitted using condor_submit command?

Thanks for all the help.


-- Tan



On Sat, Mar 22, 2008 at 8:58 PM, Daniel Forrest <forrest@xxxxxxxxxxxxx> wrote:
Hi Tan,

> For some reason, it is working now. I just wrote a shell script
> [test.sh] where I wrote:
>
> #!/bin/sh
> exec setarch i386 -R ./helloWorld -_condor_D_ALL
>
> Then. only running:
> ./test.sh
>
> now pressing CTRL-Z takes checkpoint. and then restarting by writing:
> ./helloWorld -_condor_restart helloWorld.ckpt
>
> restarts from the point where it left off.
>
> I have no explanation, perhaps any of you have it. Thank you.

I would be interested in seeing how the debug output from this differs
from what you sent before.

--
Daniel K. Forrest       Laboratory for Molecular and
forrest@xxxxxxxxxxxxx   Computational Genomics
(608) 262 - 9479        University of Wisconsin, Madison



--
Tanzima Zerin Islam