[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem restarting from standalone checkpointed programs.



Tomas Oppelstrup wrote:

> Hi,
> I have a problem restarting programs with standalone checkpointing.
> I hope this may be the correct forum for this type of question,
> otherwise I apologize and wish to be pointed to an apropriate
> place.
> 
> I have a workstation running 32-bit RedHat Enterprise Linux 5.
> (e.g.
> 
>      sh-3.2$ uname -a
>      Linux sinclaire.llnl.gov 2.6.18-128.el5 #1 SMP \
>      Wed Dec 17 11:42:39EST 2008 i686 i686 i386 GNU/Linux
> )

You are running a newer kernel, so...

> === RUNNING PROGRAM ===
> 
> sh-3.2$ setarch i686 -R ./cotest1 -_condor_D_ALL

This is a problem.  You need "setarch i686 -R -L ..."

The root problem is that in newer kernels system calls go through a
page called the VDSO and its location is also randomized.

The checkpointed program remembers the VDSO at a different location
than where it is when the program is restarted so it will fail on the
first system call after the checkpoint is restored.

Specifying the "-L" option puts the VDSO at a standard location.

Note that this limits the amount of memory that can be malloc'd to
around 1GB (the VDSO is mapped at 0x40000000-0x40001000 which limits
the top of the heap) so it isn't the greatest solution.

-- 
Dan