[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Fwd: Segfault when resuming from checkpoint





Hi Rudolf,

It seems I am having a similar problem. Jobs seem to get SIGSEGV when restarting from a checkpoint, but not always. This is an example:

4/4 19:25:03 (29.1) (18964):Requesting Primary Starter
4/4 19:25:03 (29.1) (18964):Shadow: Request to run a job was ACCEPTED
4/4 19:25:03 (29.1) (18964):Shadow: RSC_SOCK connected, fd = 17
4/4 19:25:03 (29.1) (18964):Shadow: CLIENT_LOG connected, fd = 18
4/4 19:25:03 ( 29.1) (18964):My_Filesystem_Domain = "rea.gfc.inifta.unlp.edu.ar"
4/4 19:25:03 (29.1) (18964):My_UID_Domain = " rea.gfc.inifta.unlp.edu.ar"
4/4 19:25:03 (29.1) (18964):    Entering pseudo_get_file_stream
4/4 19:25:03 (29.1) (18964):    file = "/home/condor/spool/cluster29.ickpt.subproc0"
4/4 19:25:05 (29.1 ) (18964):Reaped child status - pid 18965 exited with status 04/4 19:25:05 (29.1) (18964):Read: condor_restart:
4/4 19:25:05 (29.1) (18964):Read: Checkpoint file name is "/home/condor/spool/cluster29.proc1.subproc0"
4/4 19:25:05 (29.1) (18964):    Entering pseudo_get_file_stream
4/4 19:25:05 (29.1) (18964):    file = "/home/condor/spool/cluster29.proc1.subproc0"
4/4 19:25:05 (29.1) (18964):Read: Opened "/home/condor/spool/cluster29.proc1.subproc0" via file stream
4/4 19:25:05 (29.1) (18964):Read: Read headers OK
4/4 19:25:05 (29.1) (18964):Read: Read SegMap[0](DATA) OK
4/4 19:25:05 (29.1) (18964):Read: Read SegMap[1](STACK) OK
4/4 19:25:05 (29.1) (18964):Read: Read all SegMaps OK
4/4 19:25:05 (29.1) (18964):Read: Found a DATA block, increasing heap from 0x8618000 to 0x883d000
4/4 19:25:05 (29.1) (18964):Read: About to overwrite 7094272 bytes starting at 0x8179000(DATA)
4/4 19:25:05 (29.1 ) (18964):Reaped child status - pid 18966 exited with status 04/4 19:25:05 (29.1) (18964):Read: About to overwrite 28671 bytes starting at 0xbfff8000(STACK)
4/4 19:25:05 (29.1) (18964):Read: in Segmap::Read(): fd = 3, read_size=28671
4/4 19:25:05 (29.1) (18964):Shadow: Job 29.1 exited, termsig = 11, coredump = 0, retcode = 0
4/4 19:25:05 (29.1) (18964):Shadow: was killed by signal 11.
4/4 19:25:05 (29.1) (18964):user_time = 2 ticks
4/4 19:25:05 ( 29.1) (18964):sys_time = 95 ticks
4/4 19:25:05 (29.1) (18964):Static Policy: removing job because OnExitRemove has become true
4/4 19:25:05 (29.1) (18964):********** Shadow Exiting(102) **********


I am running Debian Linux, kernel 2.6.19.1, gcc (GCC) 4.1.2,  $CondorVersion: 6.8.4 Feb  1 2007 $, $CondorPlatform: I386-LINUX_RHEL3 $.

Is anyone else having these problems? Do we need to go to an earlier version of gcc/libc?

Thanks,

Tomas


On 4/5/07, weeber@xxxxxxxxxxxxxxxxxxxx < weeber@xxxxxxxxxxxxxxxxxxxx> wrote:
Hi,
On Fri, Mar 30, 2007 at 04:43:59PM +0200, weeber@xxxxxxxxxxxxxxxxxxxx wrote:
> Hi,
> I have a problem with jobs, that segfault, when resuming from a checkpoint after they were evicted.
> As far as I can see from the ShadowLog, the last thing that happens is, that the state of the "/dev/null" file handle is restored.
> That seems to mean, that the segfault occurs before the execution of the user code is resumed.
I did some more testing and found the following:
* The problem also occurs for stand-alone checkpointing. When I run teh program with -_condor_restart and the image-name, it sometimes(!) segfaults. If I try the same image again immediately, it works most of the time, but not always.
* It is independent of the speciffic program (I used a test program that only counts up an integer, and also one, that just calls ckpt_and_exit() in an endless loop.)
* The problem occurs on Suse 10.0 with Gcc 4.0 , Suse 10.2 with Gcc 4.1
* It does not occur on Suse 9.x with Gcc 3.4

Does anyone have any idea, on what to try next?

Thanks alot in advance,
Rudolf
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR



--
Tomas S. Grigera
INIFTA - Universidad Nacional de La Plata
c.c. 16, suc. 4, 1900 La Plata, ARGENTNA