[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] starter segfaults



Hi,

I'm doing some tests with Condor, and on a lot of startd machine this happens in the StarterLog.slotX logs:

12/04/13 21:06:04 condor_read(): timeout reading 5 bytes from <128.142.153.183:48357>.
12/04/13 21:06:04 IO: Failed to read packet header
12/04/13 21:06:04 ERROR "Assertion ERROR on (result)" at line 163 in file /slots/12/dir_29027/userdir/src/condor_starter.V6.1/NTsenders.cpp
12/04/13 21:06:04 ShutdownFast all jobs.
Stack dump for process 15117 at timestamp 1386187564 (19 frames)
/usr/lib64/condor/libcondor_utils_8_1_2.so(dprintf_dump_stack+0x58)[0x2b369de216f8]
/usr/lib64/condor/libcondor_utils_8_1_2.so(_Z18linux_sig_coredumpi+0x4d)[0x2b369df5f8bd]
/lib64/libpthread.so.0[0x307620eca0]
/usr/lib64/condor/libcondor_utils_8_1_2.so(_ZN8DCShadow13updateJobInfoEPN14compat_classad7ClassAdEb+0x3f)[0x2b369df3242f]
condor_starter(_ZN9JICShadow12updateShadowEPN14compat_classad7ClassAdEb+0x98)[0x41f2d8]
condor_starter(_ZN9JICShadow11allJobsDoneEv+0x8b)[0x42104b]
condor_starter(_ZN8CStarter11allJobsDoneEv+0xa7)[0x42ce17]
condor_starter(_ZN8CStarter12ShutdownFastEv+0xb1)[0x42a501]
condor_starter(_ZN8CStarter18RemoteShutdownFastEi+0x3a)[0x4292ea]
condor_starter(exception_cleanup+0x42)[0x445d82]
/usr/lib64/condor/libcondor_utils_8_1_2.so(_EXCEPT_+0x121)[0x2b369dea7ee1]
condor_starter(REMOTE_CONDOR_get_job_info+0x133)[0x454de3]
condor_starter(_ZN9JICShadow18getJobAdFromShadowEv+0x39)[0x41e689]
condor_starter(_ZN9JICShadow4initEv+0x13)[0x4220c3]
condor_starter(_ZN8CStarter4InitEP19JobInfoCommunicatorPKcbiii+0x55c)[0x42b36c]
condor_starter(_Z9main_initiPPc+0x70)[0x446fc0]
/usr/lib64/condor/libcondor_utils_8_1_2.so(_Z7dc_mainiPPc+0x135f)[0x2b369df6292f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x307521d9c4]
condor_starter[0x41dcb9]

On the StartLog:

12/04/13 21:06:04 Starter pid 15117 died on signal 11 (signal 11 (Segmentation fault))

Not sure, but It might be a bug (normally nothing should ever exit with a segfault).

The submit file is extremely simple-minded, it runs the /bin/sleep executable with the argument "600".