[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] core.MASTER.WIN32 and core.CRED.WIN32



The interesting bit seems to be this, which seems to indicate a memory trashing bug.   Could you send the core
dump for the CRED.  Also, you say this is repeatable.  are the core dumps for MASTER and CRED always the same?

Address   Frame
77427F1A  01AFE2E0  RtlAnsiStringToUnicodeString+171
7742730A  01AFE3D8  RtlEnumerateGenericTableWithoutSplaying+548
77427545  01AFE3F4  RtlEnumerateGenericTableWithoutSplaying+783
770B9A26  01AFE408  HeapFree+14
1001E4E6  01AFE448  pcre_version+2396
0048473C  01AFE47C  Regex::~Regex (c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)

 

On 2/10/2011 4:52 PM, Michael O'Donnell wrote:

Here is what the core.MASTER.WIN32 file states. I do not know enough about interpreting these, but the other files seem similar.

thanks

//=====================================================
PID: 4052
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  77427F1A 01:00066F1A C:\Windows\system32\ntdll.dll

Registers:
EAX:0126CCB8
EBX:01340000
ECX:00000000
EDX:00000000
ESI:0126CCB0
EDI:0127A100
CS:EIP:001B:77427F1A
SS:ESP:0023:01AFE2B8  EBP:01AFE2E0
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010246

Call stack:
Address   Frame
77427F1A  01AFE2E0  RtlAnsiStringToUnicodeString+171
7742730A  01AFE3D8  RtlEnumerateGenericTableWithoutSplaying+548
77427545  01AFE3F4  RtlEnumerateGenericTableWithoutSplaying+783
770B9A26  01AFE408  HeapFree+14
1001E4E6  01AFE448  pcre_version+2396
0048473C  01AFE47C  Regex::~Regex (c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)
0048F9E7  01AFE4B8  __ArrayUnwind (f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:128)
0048FA86  01AFE4CC  `eh vector destructor iterator' (f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:134)
004BB986  01AFE4D0  __NLG_Return2+0
004B0E60  01AFE4FC  _local_unwind4+80
004B0F2C  01AFE510  @_EH4_LocalUnwind@16+10
0049F227  01AFE544  _except_handler4+187
77425F79  01AFE568  RtlRaiseStatus+B4
77425F4B  01AFE930  RtlRaiseStatus+86
773C9C0F  01AFE954  WinSqmStartSession+490
773C4081  01AFE97C  RtlGetLengthWithoutTrailingPathSeperators+431
77425F79  01AFE9A0  RtlRaiseStatus+B4
77425F4B  01AFEA50  RtlRaiseStatus+86
77425DD7  01AFED78  KiUserExceptionDispatcher+F
7742730A  01AFEE70  RtlEnumerateGenericTableWithoutSplaying+548
77427545  01AFEE8C  RtlEnumerateGenericTableWithoutSplaying+783
770B9A26  01AFEEA0  HeapFree+14
1001E4E6  01AFEEE0  pcre_version+2396
0048473C  01AFEF14  Regex::~Regex (c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)
0048FA52  01AFEF48  `eh vector destructor iterator' (f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:134)
00478523  01AFEF98  MapFile::CanonicalMapEntry::`vector deleting destructor'+20
00478B06  01AFF044  ExtArray<MapFile::CanonicalMapEntry>::operator[] (c:\condor\execute\dir_560\userdir\src\condor_c++_util\extarray.h:152)
0041FA9F  01AFF0E8  Authentication::map_authentication_name_to_canonical_name (c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:408)
004205AE  01AFF16C  Authentication::authenticate_inner (c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:358)
0042066D  01AFF190  Authentication::authenticate (c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:113)
0042069F  01AFF1B0  Authentication::authenticate (c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:86)
0041AEC8  01AFF1FC  ReliSock::perform_authenticate (c:\condor\execute\dir_560\userdir\src\condor_io\reli_sock.cpp:973)
0041AF5D  01AFF21C  ReliSock::authenticate (c:\condor\execute\dir_560\userdir\src\condor_io\reli_sock.cpp:1001)
00422F56  01AFF264  SecManStartCommand::authenticate_inner (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1797)
00426497  01AFF2C0  SecManStartCommand::startCommand_inner (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1164)
00422941  01AFF2E8  SecManStartCommand::startCommand (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1095)
00425638  01AFF32C  SecManStartCommand::DoTCPAuth_inner (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:2114)
00425C3A  01AFF3F0  SecManStartCommand::sendAuthInfo_inner (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1375)
00498934  01AFF44C  vfprintf_helper (f:\dd\vctools\crt_bld\self_x86\crt\src\vfprintf.c:79)
00416EE4  01AFF478  _condor_dprintf_va (c:\condor\execute\dir_560\userdir\src\condor_util_lib\dprintf.c:385)
00413F27  01AFF50C  dprintf (c:\condor\execute\dir_560\userdir\src\condor_util_lib\dprintf_common.c:76)
00422941  01AFF534  SecManStartCommand::startCommand (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1095)
00423BAD  01AFF55C  SecMan::startCommand (c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:984)
0046A09F  01AFF5D8  Daemon::startCommand (c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:581)
0046BA1F  01AFF61C  Daemon::startCommand (c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:634)
0046BA55  01AFF65C  Daemon::startCommand (c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:643)
0047F29A  01AFF6B0  DCMessenger::sendBlockingMsg (c:\condor\execute\dir_560\userdir\src\condor_daemon_client\dc_message.cpp:352)
0046A9AE  01AFF6E0  Daemon::sendBlockingMsg (c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:2307)
0043B950  01AFF718  DaemonCore::Send_Signal (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:5367)
0043CB3A  01AFF748  DaemonCore::Send_Signal (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:5116)
0040520A  01AFF784  daemon::Kill (c:\condor\execute\dir_560\userdir\src\condor_master.v6\masterdaemon.cpp:1187)
00407970  01AFF7D0  daemon::Reconfig (c:\condor\execute\dir_560\userdir\src\condor_master.v6\masterdaemon.cpp:1226)
00438C74  01AFF9C8  DaemonCore::HandleReq (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4894)
00438EAD  01AFF9D8  DaemonCore::HandleReq (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3772)
00439067  01AFFA0C  DaemonCore::CallSocketHandler_worker (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3468)
0043938E  01AFFA2C  DaemonCore::CallSocketHandler_worker_demarshall (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3424)
00439663  01AFFA54  DaemonCore::CallSocketHandler (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3412)
0043B664  01AFFAF0  DaemonCore::Driver (c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3325)
0048F1A1  01AFFAFC  free (f:\dd\vctools\crt_bld\self_x86\crt\src\free.c:110)

//=====================================================


- - - - - - - - - - - - - - - - - - - - - - - - - -
Michael O'Donnell
ADP Software Specialist, ASRC Management Services
USGS Fort Collins Science Center
2150 Centre Ave., Bldg C
Fort Collins, CO 80526

Phone: 970.226.9407
Fax: 970.226.9230
Email: odonnellm@xxxxxxxx




From: Ziliang Guo <ziliang@xxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 02/10/2011 03:01 PM
Subject: Re: [Condor-users] core.MASTER.WIN32 and core.CRED.WIN32
Sent by: condor-users-bounces@xxxxxxxxxxx





The file being generated should be a core dump file.  You should be able to look inside it to see where Condor is crashing, or send it our way for us to investigate.
 
Z
 
Condor Project

On Thu, Feb 10, 2011 at 2:23 PM, Michael O'Donnell <odonnellm@xxxxxxxx> wrote:

While trying to figure this out I am noticing a couple things. First, my cred service is dying on the central manager, which throws the core.CRED.WIN32 file. If I delete this file the service will generally restart, but sometimes I have to restart the Condor service to get the cred service to start again.


I am also noticing that on my submit machine a core.STARTD.WIN32 file is created and this might be related to why jobs are remaining in idle.


However, I do not know what any of this means. The load average on the CM is on average 30%, with spikes as high as 70%. This seems a little high since we are not running any other services on the server. The collector is usually at about 25% and the spikes are caused from the other Condor services (mainly the negotiator).


My search on google for access violation to
C:\Windows\system32\ntdll.dll and memory problems are plentiful, but because they vary and because we were not having problems before I am not making a lot of progress trying to figure this out. It does seem like these files are related to the inability of jobs to match when in fact I know that machines are available.

thanks,

mike



From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 02/09/2011 03:41 PM
Subject: [Condor-users] core.MASTER.WIN32 and core.CRED.WIN32
Sent by: condor-users-bounces@xxxxxxxxxxx








I have noticed on our central manager that two files are created. These files include:

core.MASTER.WIN32 and core.CRED.WIN32



The header content of the files include:

PID: 660

Exception code: C0000005 ACCESS_VIOLATION

Fault address:  77427F1A 01:00066F1A C:\Windows\system32\ntdll.dll



If I delete the files they are re-created, and I do not recall seeing the files in the past. Does anyone know what this access violation is about. Could there be a problem with antivirus or something. Our pool is functioning with the exception that all jobs remain in idle, which started after expanding our pool from 100 cores to 200 cores (posted earlier today--[Condor-users] Job remains in idle (worked until I increased pool size). I don't think this is related, but I am trying to troubleshoot this.


Thank you for your help,


Mike
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:

https://lists.cs.wisc.edu/archive/condor-users/



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:

https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/