[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [Condor] Problem fcdfcaf1709.fnal.gov: condor_startd died (11)



The core dump is coming as condor is doing a fast shutdown just before the node reboots.
In the StarterLogs I see the following:


-----Original Message-----
From: Steven C Timm 
Sent: Wednesday, May 02, 2012 8:55 PM
To: 'condor-users@xxxxxxxxxxx'
Subject: FW: [Condor] Problem fcdfcaf1709.fnal.gov: condor_startd died (11)

This cluster just upgraded to condor 7.6.6 yesterday.  Condor_startd crashed sometimes before the upgrade from 7.4->7.6 too but never with a crash dump that looked anything like this.  Any ideas?

Steve Timm

05/03/12 08:11:24 DaemonCore: pid 30476 exited with status 9, invoking reaper 1 
<Reaper>
05/03/12 08:11:24 Process exited, pid=30476, signal=9
05/03/12 08:11:24 passwd_cache::cache_uid(): getpwnam("minosgli") failed: user n
ot found
05/03/12 08:11:24 passwd_cache: initgroups() failed! errno=Operation not permitt
ed
05/03/12 08:11:24 passwd_cache: num_groups( minosgli ) returned 0
Stack dump for process 30475 at timestamp 1336050684 (29 frames)
condor_starter(dprintf_dump_stack+0x56)[0x5d3a56]
condor_starter(_Z18linux_sig_coredumpi+0x4d)[0x51b9ad]
/lib64/libpthread.so.0[0x3a4180ebe0]
/lib64/libc.so.6(gsignal+0x35)[0x3a40c30265]
/lib64/libc.so.6(abort+0x110)[0x3a40c31d10]
/lib64/libc.so.6[0x3a40c690eb]
/lib64/libc.so.6[0x3a40c70baf]
/lib64/libc.so.6(cfree+0x4b)[0x3a40c7100b]
condor_starter(_ZN12passwd_cache12cache_groupsEPKc+0xba)[0x6031ba]
condor_starter(_ZN12passwd_cache12lookup_groupEPKcRP11group_entry+0xa4)[0x6033b4
]
condor_starter(_ZN12passwd_cache10num_groupsEPKc+0x26)[0x603556]
condor_starter(_ZN12passwd_cache11init_groupsEPKcj+0x30)[0x6035e0]
condor_starter(_set_priv+0x210)[0x610ff0]
condor_starter(_condor_dprintf_va+0x33d)[0x5d558d]
condor_starter(dprintf+0x86)[0x5d5766]
condor_starter(_ZN12passwd_cache11init_groupsEPKcj+0xee)[0x60369e]
condor_starter(_set_priv+0x210)[0x610ff0]
condor_starter(_ZN6OsProc14renameCoreFileEPKcS1_+0xbd)[0x4edb6d]
condor_starter(_ZN6OsProc13checkCoreFileEv+0x173)[0x4ef3d3]
condor_starter(_ZN6OsProc9JobReaperEii+0x66)[0x4ef496]
condor_starter(_ZN11VanillaProc9JobReaperEii+0x46)[0x4f5e26]
condor_starter(_ZN8CStarter6ReaperEii+0xc8)[0x4d9718]
condor_starter(_ZN10DaemonCore10CallReaperEiPKcii+0x11a)[0x502f0a]
condor_starter(_ZN10DaemonCore17HandleProcessExitEii+0x1a7)[0x515757]
condor_starter(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x2e)[0x5158be]
condor_starter(_ZN10DaemonCore6DriverEv+0x1ec)[0x50e49c]
condor_starter(main+0xe47)[0x51e127]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3a40c1d994]
condor_starter(__gxx_personality_v0+0x401)[0x4cf339]
---------------------------------------------------------------------------------

Any ideas?

It appears that this core dump is happening every time we shut down condor with a fast shutdown and reboot a node

-----Original Message-----
From: root e-mail messages from fermigrid [mailto:FERMIGRID-ROOT@xxxxxxxxxxxxxxxxx] On Behalf Of postmaster@xxxxxxxxxxxxxxxxxxxx
Sent: Wednesday, May 02, 2012 8:14 PM
To: FERMIGRID-ROOT@xxxxxxxxxxxxxxxxx
Subject: [Condor] Problem fcdfcaf1709.fnal.gov: condor_startd died (11)

This is an automated email from the Condor system on machine "fcdfcaf1709.fnal.gov".  Do not reply.

"/usr/sbin/condor_startd" on "fcdfcaf1709.fnal.gov" died due to signal 11 (Segmentation fault).
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /var/log/condor/StartLog:
condor_startd(_ZN7CronJob15StartJobProcessEv+0x157)[0x5c74c7]
condor_startd(_ZN17CondorCronJobList17StartOnDemandJobsEv+0x4b)[0x5c8b4b]
condor_startd(_ZN10CronJobMgr17StartOnDemandJobsEv+0xd)[0x5c93cd]
condor_startd(_ZN17StartdBenchJobMgr15StartBenchmarksEP8ResourceRi+0x6e)[0x4f35be]
condor_startd(_ZN14MachAttributes16start_benchmarksEP8ResourceRi+0xc8)[0x4ce5c8]
condor_startd(_ZN8ResState4evalEv+0x482)[0x4d7352]
condor_startd(_ZN8ResState12enter_actionE5State8Activitybb+0x281)[0x4d6c21]
condor_startd(_ZN8ResState6changeE5State8Activity+0xe7)[0x4d62e7]
condor_startd(_ZN8Resource22leave_preempting_stateEv+0xf8)[0x4d8a28]
condor_startd(_ZN8ResState12enter_actionE5State8Activitybb+0x3b6)[0x4d6d56]
condor_startd(_ZN8ResState6changeE5State8Activity+0xe7)[0x4d62e7]
condor_startd(_ZN8ResState4evalEv+0x3a7)[0x4d7277]
condor_startd(_Z6reaperP7Serviceii+0x69)[0x4f56a9]
condor_startd(_ZN10DaemonCore10CallReaperEiPKcii+0xa7)[0x500c97]
condor_startd(_ZN10DaemonCore17HandleProcessExitEii+0x1a7)[0x513557]
condor_startd(_ZN10DaemonCore24HandleDC_SERVICEWAITPIDSEi+0x2e)[0x5136be]
condor_startd(_ZN10DaemonCore6DriverEv+0x1ec)[0x50c29c]
condor_startd(main+0xe47)[0x51bff7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3edac1d994]
condor_startd(__gxx_personality_v0+0x3f9)[0x4cb339]
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: fermigrid-root@xxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor