[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Core dumps from startd




This problem has been fixed in 6.8.8 (already released) and 7.0.0 (to be released in a day or two). It is indeed caused by the use of integrity checking and/or encryption.

--Dan

Miskell, Craig wrote:
Hi,
	I'm seeing core dumps from startd (once I made my log dirs
writable by root ;-)).  The stack traces out of gdb are all the same:
#0  0x0813c046 in WriteCoreDump ()
(gdb) bt
#0  0x0813c046 in WriteCoreDump ()
#1  0x0812d634 in linux_sig_coredump ()
#2  <signal handler called>
#3  0x081c35bd in _condorInMsg::peek ()
#4  0x081bbbc7 in SafeSock::peek ()
#5  0x081bc83c in SafeSock::isIncomingDataMD5ed ()
#6  0x08123c37 in DaemonCore::HandleReq ()
#7  0x0812392d in DaemonCore::HandleReq ()
#8  0x081233d1 in DaemonCore::Driver ()
#9  0x0812fcfe in main ()

The crashes are intermittent (per node), but a large job will trigger
off a deluge of them, which has the side-effect of hammering throughput
on our shorter running jobs, as startd dies, restarts, and jobs hang
around in limbo.  (I don't' think the scheduler does particularly well
out of this either, but that's speculation).

Does anyone want the actual core files to debug from?  If so, I'm happy
to send them off list (300K - 1.5M zipped).
Meanwhile, are there any suggestions to avoid whatever code-path is
causing this?  The isIncomingDataMD5ed makes me wonder about whether
having SEC_DEFAULT_INTEGRITY = REQUIRED
is causative. Turning that on was a "belt-and-braces" approach when I
first configured our compute farm, and our network is secure enough to
turn it off for a while (at least for testing).  Any comments from those
who know the code as to whether this is likely to be successful?

Thanks,

Craig Miskell,
Technical Support,
AgResearch Invermay
03 489-9279
"There are no problems that cannot be solved by the judicious use of
high explosives" -- British Commando quote, circa WWII. =======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/