[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ntdll.dll access_violation on windows xp sp2 + dataexecution prevention



Looking at the stack trace you gave below, it certainly looks like the kerberos library that is linked into the Condor daemons is to blame.

On Windows, Condor links with unmodified MIT kerb5 version 1.4.3 libraries.  Perhaps someone at MIT could comment on why kerb code is triggering DEP faults (we could point them at this thread).  Or perhaps Condor should link with a more recent version of Kerberos in the hopes that this bug has already been squashed.

---
Todd Tannenbaum
University of Wisconsin-Madison
<-- Sent from a Palm Treo 680 phone -->

-----Original Message-----

From:  Rob de Graaf <rob@xxxxxxxxxxxxxxxxxx>
Subj:  [Condor-users] ntdll.dll access_violation on windows xp sp2 + dataexecution prevention
Date:  Fri May 11, 2007 12:30 pm
Size:  2K
To:  condor-users@xxxxxxxxxxx

Hello,

I've set up a test environment including a central manager running 
Slackware linux and an execute only node running Windows XP service pack 
2, both using condor 6.8.4. Both machines authenticate using kerberos 
and I have successfully run some test jobs.

However, on the Windows execute node, the condor daemons don't always 
start properly. Sometimes the condor_master will fail, sometimes the 
condor_startd will fail, sometimes they both fail. When they go down 
they leave a core file in the log directory. This one is from when the 
condor_master failed to start:

    core.MASTER.win32
    //=====================================================
    Exception code: C0000005 ACCESS_VIOLATION
    Fault address:  7C93426D 01:0003326D C:\WINDOWS\system32\ntdll.dll

    Registers:
    EAX:FFFFFFFF
    EBX:00000362
    ECX:000320F0
    EDX:00000000
    ESI:000303A8
    EDI:000305D8
    CS:EIP:001B:7C93426D
    SS:ESP:0023:00CDF154  EBP:00CDF374
    DS:0023  ES:0023  FS:003B  GS:0000
    Flags:00010246

    Call stack:
    Address   Frame     Logical addr  Module
    7C93426D  00CDF374  0001:0003326D C:\WINDOWS\system32\ntdll.dll
    77C2C3C9  00CDF3B4  0001:0001B3C9 C:\WINDOWS\system32\msvcrt.dll
    77C2C3E7  00CDF3C0  0001:0001B3E7 C:\WINDOWS\system32\msvcrt.dll
    77C2C42E  00CDF3D0  0001:0001B42E C:\WINDOWS\system32\msvcrt.dll
    0034E510  00CDF3EC  0001:0000D510 C:\condor\bin\krb5_32.dll
    00342FD2  00CDF420  0001:00001FD2 C:\condor\bin\krb5_32.dll
    003824C7  00CDF638  0001:000414C7 C:\condor\bin\krb5_32.dll
    00464C76  00CDF684  0001:00063C76 C:\condor\bin\condor_master.exe
    0046440F  00CDF698  0001:0006340F C:\condor\bin\condor_master.exe
    0045FDCC  00CDF70C  0001:0005EDCC C:\condor\bin\condor_master.exe
    0045FB45  00CDF728  0001:0005EB45 C:\condor\bin\condor_master.exe
    00458951  00CDF758  0001:00057951 C:\condor\bin\condor_master.exe
    004589A7  00CDF9F8  0001:000579A7 C:\condor\bin\condor_master.exe
    00439747  00CDFE58  0001:00038747 C:\condor\bin\condor_master.exe

The core file the condor_startd creates on failure has the same entries. 
The condor log files show nothing unusual.

After some digging I learned about the existence of something called 
Data Execution Prevention (system properties -> advanced -> performance 
settings -> data execution prevention) which was apparently added to 
Windows XP with service pack 2. Adding the condor daemons to the 
exception list appears to solve the problem, however I would prefer not 
to have to do that.

As far as I can tell, the problem only occurs when condor is configured 
to use kerberos authentication. The central manager is, for testing 
purposes, also the KDC, using MIT kerberos version krb5-1.6.1. The 
execute node runs a fully patched Windows XP SP2 and MIT kerberos for 
windows version kfw-3.2.0. Both use condor-6.8.4.

Has anyone encountered this problem? What could be triggering windows' 
data execution prevention? How can I avoid adding the condor daemons to 
--- message truncated ---