[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standard universe woes (v7.4.2)



Hello,

On Tue, Jun 08, 2010 at 05:53:58PM +0200, Francesco.Prelz@xxxxxxxxxx wrote:
> I'm trying to help a local user (who happens to be my Director :-O)
> to run his physics code on the INFN Condor pool. We managed to coax the
> code (Fortran, C++, ROOT and other zoo) into a form suitable for
> condor_compiling that also efficiently produces the expected results. 
> 
> I'm hitting two problems, however:
> 
> 1) The __libc_start_main linked from 
>    /opt/condor/lib/libcondor_c.a(libc-start.o) (Condor version 7.4.2)
>    is apparently performing the 'default' glibc (DL_SYSDEP_OSCHECK) kernel
>    version check asking for for kernels > 2.6.8 (carefully researched version)
>    at least according to the assembly code:
>    0x096244fc <__libc_start_main+252>:     cmp    $0x20608,%edx
>    Jobs get a fatal abort otherwise.
> 
>    From glibc documentation it seems that it is possible to build the library
>    so that it can run on older kernels. Unless there is some other good reason
>    preventing that, the glibc that's distributed with Condor for standard
>    universe jobs could benefit from being built like that.
>    
>    Not having the option to force people who generously donate to our pool
>    to upgrade their kernels, I am currently unable to find any easy workaround 
>    other than excluding 'old kernel' pool nodes as I find them: is there any 
>    trick to match on Linux kernel versions that doesn't require to tweak the 
>    startd configuration on all pool nodes ? 

I found the code in glibc which implements this and read the comment to
which you allude:

"This test is only performed if the library is not compiled to run on
all kernels."

Here is the option which allows one to adjust that feature:

---------------
--enable-kernel=version
    This option is currently only useful on GNU/Linux systems. The version
    parameter should have the form X.Y.Z and describes the smallest
    version of the Linux kernel the generated library is expected to
    support. The higher the version number is, the less compatibility
    code is added, and the faster the code gets.
---------------

In our own re-build of glibc, we set the kernel version to be 2.6.9 since
it seemed like a good version at the time (years ago). This explains
what you found when running it on earlier kernels.

What revisions would you be interested in supporting? Modern kernels
versions are around 2.6.22 or so with a smattering of 2.6.9 to 18-ish
ones.

Sadly, Condor by default does not formally publish anything about the
kernel/distro/etc on any machine it is running. It is usually up to the
site admins to create a policy like that and disseminate it to the users.

There may be a _very nasty workaround_ where you use allow_startup_script
with your standard universe job: encode your executable as a shar archive,
hand edit the resultant script that when it runs it checks the kernel
version, and if ok it extracts the executable and execs it <- important
to exec!

If it doesn't like the kernel it exits with a sentinel value that you
know the job can't exit with, and then use blackhole detection/avoidance 
policies to auto-record and forbid running the job on that machine.

I won't be able to help set up such a workaround, if you actually deem
it to be necessary, since I'll be on vacation from June 18th to June
25th. Thinking of workarounds like this take a toll on one's sanity. :)

Dan Bradely or Todd Tannenbaum would be good candidates for setting up
the (artificial) black hole avoidance policy.

> 2) In the same time-dishonored INFN pool I'm also hitting condor_starters that 
>    apparently do not set ADDR_NO_RANDOMIZE in the process personality, 
>    resulting in crashes while attempting to write the randomized virtual memory
>    'hole' to the checkpoint file. In which version(s) of Condor was the
>    ADDR_NO_RANDOMIZE flag added for standard universe jobs ? Knowing this
>    would help me in selecting friendly installations.

As far as I can determine while looking through a cvs -> git
transformation of our repository, and other major code history phase
changes, Condor 6.8 and later is known to use ADDR_NO_RANDOMIZE in
the starter. It was introduced in the 6.7.x series, but I can't find
good record of the exact addition of that feature into the codebase.
You can use the machine attribute CondorVersion (and classad regexes if
your submit machine is a very recent version of Condor) to select the
versions you want to exclude or utilize.

Thank you.

-pete