[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_collector died (11) or exited (4)



Hi All
 
We've recently been getting two of our Central Managers periodically
(every few hours?) restarting condor_collector after the daemon either
dies with signal 11 or exits with signal 4. All of our six CMs are running
condor-7.2.3. Five are running as VM's on ESX servers and one is a physical
Dell PowerEdge 750. Three of the VMs are running 64 bit SLES10, two
running 32 bit RHES4 and the one physical machine is running 32 bit RHES4.
All RH and SLES machines have been cloned from original setups (including
condor already being installed).
 
Only 2 of the machines are having these problems. These are one VM
running 64bit SLES10 and one VM running 32bit RHES4. We have restarted
condor on these machines, as well as rebooting the machines themselves,
all to no avail.
 
Extracts from logs follow:
 
This is for the SLES10 Machine died with signal 11
 
7/10 11:21:16 StartdAd     : Inserting ** "< CLW-FZ6JY1S-GW.nexus.csiro.au , 144
.110.17.30 >"
7/10 11:21:16 StartdPvtAd  : Inserting ** "< CLW-FZ6JY1S-GW.nexus.csiro.au , 144
.110.17.30 >"
7/10 11:21:32 Got INVALIDATE_STARTD_ADS
7/10 11:21:32           **** Removing stale ad: "< 210087-NT.nexus.csiro.au , 13
0.155.34.194 >"
7/10 11:21:32           **** Removing stale ad: "< PORTER-BE.nexus.csiro.au , 15
2.83.192.199 >"
7/10 11:21:32 (Invalidated 1 ads)
Stack dump for process 1862 at timestamp 1247196092 (17 frames)
condor_collector(dprintf_dump_stack+0xb3)[0x5096e7]
condor_collector(_Z18linux_sig_coredumpi+0x28)[0x4ff36c]
/lib64/libc.so.6[0x2b95518c8e20]
condor_collector(_ZNK9HashTableI10YourStringP12AttrListElemE6lookupERKS0_RS2_+0x
18)[0x570f30]
condor_collector(_ZNK8AttrList6LookupEPKc+0x7b)[0x56d60d]
condor_collector(_ZNK8AttrList13LookupIntegerEPKcRi+0x21)[0x56db31]
condor_collector(_ZN15CollectorEngine14cleanHashTableER9HashTableI13AdNameHashKe
yP7ClassAdElPFbRS1_S3_P11sockaddr_inE+0x67)[0x4e424d]
condor_collector(_ZN15CollectorEngine17invokeHousekeeperE7AdTypes+0xf2)[0x4e1a22
]
condor_collector(_ZN15CollectorDaemon20process_invalidationE7AdTypesR7ClassAdP6S
tream+0x7c)[0x4cf940]
condor_collector(_ZN15CollectorDaemon20receive_invalidationEP7ServiceiP6Stream+0
x32e)[0x4cf0c2]
condor_collector(_ZN10DaemonCore9HandleReqEP6Stream+0x36db)[0x4f2f1d]
condor_collector(_ZN10DaemonCore9HandleReqEi+0x36)[0x4ef840]
condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x2b3)[0x4ef2e9]
condor_collector(_ZN10DaemonCore6DriverEv+0x1463)[0x4eef21]
condor_collector(main+0x183f)[0x501d2f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x2b95518b6164]
condor_collector(__strtoll_internal+0x5a)[0x4b716a]
7/10 11:21:42 ******************************************************
7/10 11:21:42 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/10 11:21:42 ** /usr/local/condor/sbin/condor_collector
7/10 11:21:42 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
7/10 11:21:42 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
7/10 11:21:42 ** $CondorVersion: 7.2.3 May 11 2009 BuildID: 151729 $
7/10 11:21:42 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
7/10 11:21:42 ** PID = 22874
7/10 11:21:42 ** Log last touched 7/10 11:21:32
7/10 11:21:42 ******************************************************
7/10 11:21:42 Using config source: /home/condor/condor_config
7/10 11:21:42 Using local config sources:
7/10 11:21:42    /home/condor/condor_config.local
 
 
This is for the RHES4 Machine died with signal 11
 
7/10 13:28:30 (Sending 876 ads in response to query)
7/10 13:28:31 Got QUERY_STARTD_PVT_ADS
7/10 13:28:31 (Sending 435 ads in response to query)
7/10 13:28:37 NegotiatorAd  : Inserting ** "< condor-nsw.riverside.csiro.au >"
7/10 13:28:51 StartdAd     : Inserting ** "< MILFORD-LN.tip.csiro.au , 192.168.0
.1 >"
Stack dump for process 3209 at timestamp 1247196531 (17 frames)
condor_collector(dprintf_dump_stack+0xda)[0x81314b7]
condor_collector(_Z18linux_sig_coredumpi+0x23)[0x81275e3]
/lib/tls/libc.so.6[0x9c5918]
condor_collector(_ZNK8AttrList6LookupEPKc+0x70)[0x8190446]
condor_collector(_ZNK8AttrList13LookupIntegerEPKcRi+0x14)[0x819093a]
condor_collector(_ZN14CollectorStats6updateEPKcP7ClassAdS3_+0xa5)[0x8107073]
condor_collector(_ZN15CollectorEngine13updateClassAdER9HashTableI13AdNameHashKey
P7ClassAdEPKcS7_S3_RS1_RK8MyStringRiPK11sockaddr_in+0x19b)[0x810db0f]
condor_collector(_ZN15CollectorEngine7collectEiP7ClassAdP11sockaddr_inRiP4Sock+0
x396)[0x810c8e4]
condor_collector(_ZN15CollectorEngine7collectEiP4SockP11sockaddr_inRi+0x13d)[0x8
10c115]
condor_collector(_ZN15CollectorDaemon14receive_updateEP7ServiceiP6Stream+0x79)[0
x80f9a99]
condor_collector(_ZN10DaemonCore9HandleReqEP6Stream+0x37f7)[0x811bcb9]
condor_collector(_ZN10DaemonCore9HandleReqEi+0x2d)[0x81184bd]
condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x280)[0x8117f5c]
condor_collector(_ZN10DaemonCore6DriverEv+0x1352)[0x8117bce]
condor_collector(main+0x1829)[0x812a03d]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0x9b2df3]
condor_collector(ldexp+0x59)[0x80e4db1]
7/10 13:29:02 ******************************************************
7/10 13:29:02 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/10 13:29:02 ** /usr/local/condor/sbin/condor_collector
7/10 13:29:02 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
7/10 13:29:02 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
7/10 13:29:02 ** $CondorVersion: 7.2.3 May 11 2009 BuildID: 151729 $
7/10 13:29:02 ** $CondorPlatform: I386-LINUX_RHEL3 $
7/10 13:29:02 ** PID = 10448
7/10 13:29:02 ** Log last touched 7/10 13:28:51
7/10 13:29:02 ******************************************************
7/10 13:29:02 Using config source: /home/condor/condor_config
7/10 13:29:02 Using local config sources:
7/10 13:29:02    /home/condor/condor_config.local
When they "exit" rather than "die" both give a line as below for exiting with signal 4

7/9 16:36:35 ERROR "Assertion ERROR on (hash)" at line 1073 in file attrlist.cpp

As mentioned this is only happening for 2 out of the 6 servers, all of which
"should" be identical.
 
Thanks for any help
 
Cheers
 
Greg