[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Does the standard universe really work on x86_64?



Hi Pete,

Many thanks for replying. My responses to your comments and questions are inlined below:

Peter Keller wrote:
Hello,

I'm guessing that this is a dynamic library incompatibility problem, but how do we get out of this pickle? What x86_64 linux distros are people successfully using with the standard universe? For what it's worth, we can get the standard universe to work on i386 platforms OK.

Debugging standard universe problems is pretty tough because the
checkpointing and remote i/o libraries live in the same vm space as the
user code itself. So, what we should do at this point is to have you check
your program with various memory checkers like valgrind(3.2.1 or later)
and whatnot to give it a clean bill of health. The standard universe
codebase really alters the memory usage patterns of applications and
has a tendancy to flush out previously hidden bugs.

OK, I have no experience with valgrind but I'll have a go at it. An initial run with the image built using gcc v4.1.0 gives the following (though I'll try to come up with something more useful):

==29843== Use of uninitialised value of size 8
==29843== at 0x4ADEEB: std::locale::operator=(std::locale const&) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A6762: std::ios_base::_M_init() (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A4EFC: std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x49B749: load_parameter_file(char*, unsigned&, unsigned short*, double&, double&, unsigned&, unsigned&, unsigned&, unsigned&, double&, double&, unsigned&, double&, double&, double&, Ensemble&, unsigned&, unsigned&) (fstream:445)
==29843==    by 0x49C616: main (main.cxx:103)
==29843==
==29843== Invalid read of size 4
==29843== at 0x4ADEEB: std::locale::operator=(std::locale const&) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A6762: std::ios_base::_M_init() (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A4EFC: std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x49B749: load_parameter_file(char*, unsigned&, unsigned short*, double&, double&, unsigned&, unsigned&, unsigned&, unsigned&, double&, double&, unsigned&, double&, double&, double&, Ensemble&, unsigned&, unsigned&) (fstream:445)
==29843==    by 0x49C616: main (main.cxx:103)
==29843==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==29843==
==29843== Process terminating with default action of signal 11 (SIGSEGV)
==29843==  Access not within mapped region at address 0x0
==29843== at 0x4ADEEB: std::locale::operator=(std::locale const&) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A6762: std::ios_base::_M_init() (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x4A4EFC: std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) (in /home/Condor/mark/test/diffmc/diffmc-standard-x86_64-v4.1.0) ==29843== by 0x49B749: load_parameter_file(char*, unsigned&, unsigned short*, double&, double&, unsigned&, unsigned&, unsigned&, unsigned&, double&, double&, unsigned&, double&, double&, double&, Ensemble&, unsigned&, unsigned&) (fstream:445)
==29843==    by 0x49C616: main (main.cxx:103)


In addition to the above, can you answer these few questions:

Is it threaded? If so by what method? (cooperative threading or NPTL with
pthreads?)

No

Are you using the STL?

No

When your application is compiled with -g, what is the full backtrace
of the error? Try it both with gcc 4.1.0 and gcc 3.4.6.

Here they are:

::::::::::::::
gcc v3.4.6
::::::::::::::
Program received signal SIGSEGV, Segmentation fault.
0x00000000004adb6b in std::locale::operator= ()
(gdb) bt
#0  0x00000000004adb6b in std::locale::operator= ()
#1  0x00000000004a63e3 in std::ios_base::_M_init ()
#2 0x00000000004a4b7d in std::basic_ios<char, std::char_traits<char> >::init () #3 0x000000000049b4eb in load_parameter_file (name=0x7fff22904507 "NVT_PARAM", N=@0x776dc4, seed16v=0x776e14, kT=@0x776de0, P=@0x776df0, count=@0x776dc0,
   total_steps=@0x776e10, equil_steps=@0x776e3c, save_steps=@0x776cec,
   max_disp=@0x776de8, max_squeeze=@0x776df8, press_ratio=@0x776ce8,
   Rc=@0x776e28, Kc=@0x776e20, kappa=@0x776e30, ensemble=@0x776e00,
   interaction_count=@0x776e38, ewald_in_use=@0x776db8)
at /home/Condor/mark/lib/gcc/x86_64-unknown-linux-gnu/3.4.6/../../../../include/c++/3.4.6/fstream:526
#4  0x000000000049c73d in main (argc=3, argv=0x7fff22904158) at main.cxx:86

::::::::::::::
gcc v4.1.0
::::::::::::::
Program received signal SIGSEGV, Segmentation fault.
0x00000000004adeeb in std::locale::operator= ()
(gdb) bt
#0  0x00000000004adeeb in std::locale::operator= ()
#1  0x00000000004a6763 in std::ios_base::_M_init ()
#2 0x00000000004a4efd in std::basic_ios<char, std::char_traits<char> >::init () #3 0x000000000049b74a in load_parameter_file (name=0x7fff4611e507 "NVT_PARAM", N=@0x776d00, seed16v=0x776ec4, kT=@0x776e90, P=@0x776ee0, count=@0x776d04,
   total_steps=@0x776ef0, equil_steps=@0x776ef4, save_steps=@0x776de0,
   max_disp=@0x776e88, max_squeeze=@0x776ee8, press_ratio=@0x776de4,
   Rc=@0x776eb0, Kc=@0x776ea8, kappa=@0x776eb8, ensemble=@0x776ed8,
   interaction_count=@0x776ef8, ewald_in_use=@0x776d0c)
   at /usr/include/c++/4.1.0/fstream:445
#4  0x000000000049c617 in main (argc=3, argv=0x7fff4611cce8) at main.cxx:103

Although at first glance these seem to be referring to different lines in main.cxx, they actually refer to the same function call (one to the line where the call begins and the other to the line where it ends; the call spans many lines of text).

In both cases the error seems to be flagging the same line of the corresponding header file, fstream, which is in the constructor:

explicit
basic_ifstream(const char* __s, ios_base::openmode __mode = ios_base::in)
     : __istream_type(), _M_filebuf()
     {
this->init(&_M_filebuf); // <- This line is being flagged
       this->open(__s, __mode);
     }

Again, I reiterate that this code works fine with both compilers when not condor_compiled.

Thanks for your help. It's very appreciated!

Regards,
Mark