[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor 8.1.1 Sched daemon crashes when submitting a job....known issue?



Hi,

The htcondor master is running Fedora 20, which comes with HTCondor 8.1.1.

As soon as I submit a job, the sched daemon crashes (see below).
Is this a known bug? Or even an already solved bug?
Also an email is sent with following content:

---------------------------------------------------
This is an automated email from the Condor system
on machine "condor.su.com".  Do not reply.

"/usr/sbin/condor_schedd" on "condor.su.com" died due to signal 6 (Aborted).
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /var/log/condor/SchedLog:
/usr/sbin/../lib/libstdc++.so.6(+0x4a4f4)[0xb6dc84f4]
/usr/sbin/../lib/libstdc++.so.6(+0x4a530)[0xb6dc8530]
/usr/sbin/../lib/libstdc++.so.6(__cxa_rethrow+0x0)[0xb6dc87a0]
/usr/sbin/../lib/libstdc++.so.6(_ZSt19__throw_logic_errorPKc+0x8f)[0xb6e23fff]
/usr/sbin/../lib/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0xea)[0xb6e3165a]
/usr/sbin/../lib/libstdc++.so.6(_ZNSsC1EPKcRKSaIcE+0x41)[0xb6e31c61]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN11DCCollector21getBlacklistTimesliceEv+0x48)[0xb768c218]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN11DCCollector13isBlacklistedEv+0x1c)[0xb768c4fc]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN13CollectorList5queryER11CondorQueryRN14compat_classad11ClassAdListEP11CondorError+0x1ed)[0xb76a1dad]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon13getDaemonInfoE7AdTypesb+0x773)[0xb769e633]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon6locateEv+0x2ca)[0xb769efca]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon17hasUDPCommandPortEv+0x17)[0xb769ab17]
condor_schedd(_ZN9Scheduler14sendRescheduleEv+0x1d8)[0x80883d8]
condor_schedd(_ZN9Scheduler7timeoutEv+0x228)[0x80b6858]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN12TimerManager7TimeoutEPiPd+0x177)[0xb76cfe27]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN10DaemonCore6DriverEv+0x472)[0xb76c3952]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_Z7dc_mainiPPc+0x1779)[0xb76b11a9]
condor_schedd(main+0x58)[0x80691a8]
/usr/sbin/../lib/libc.so.6(__libc_start_main+0xf3)[0xb6b59b73]
---------------------------------------------------


The contents of the SchedLog at the time of its crash:
---------------------------------------------------
02/04/14 18:22:34 (pid:28810) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/04/14 18:22:34 (pid:28810) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/04/14 18:22:34 (pid:28810) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/04/14 18:22:35 (pid:28810) Failed to start non-blocking update to unknown.
02/04/14 18:22:35 (pid:28810) attempt to connect to <215.145.137.136:9618> failed: Connection refused (connect errno = 111).
02/04/14 18:22:35 (pid:28810) ERROR: SECMAN:2004:Failed to create security session to <215.145.137.136:9618> with TCP.|SECMAN:2003:TCP connection to <215.145.137.136:9618> failed.
02/04/14 18:22:35 (pid:28810) Failed to start non-blocking update to <215.145.137.136:9618>.
02/04/14 18:22:55 (pid:28810) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
02/04/14 18:22:55 (pid:28810) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/04/14 18:22:55 (pid:28810) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/04/14 18:22:55 (pid:28810) Failed to start non-blocking update to unknown.
02/04/14 18:22:55 (pid:28810) Sent ad to central manager for peter@xxxxxxxxxxxxx
02/04/14 18:22:55 (pid:28810) Sent ad to 1 collectors for peter@xxxxxxxxxxxxx
Stack dump for process 28810 at timestamp 1391505775 (27 frames)
/usr/sbin/../lib/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x66)[0xb7576ce6]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(+0x17e106)[0xb7607106]
[0xb77cb400]
[0xb77cb424]
/usr/sbin/../lib/libc.so.6(gsignal+0x46)[0xb6b6eb96]
/usr/sbin/../lib/libc.so.6(abort+0x143)[0xb6b703d3]
/usr/sbin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1a5)[0xb6dcaab5]
/usr/sbin/../lib/libstdc++.so.6(+0x4a4f4)[0xb6dc84f4]
/usr/sbin/../lib/libstdc++.so.6(+0x4a530)[0xb6dc8530]
/usr/sbin/../lib/libstdc++.so.6(__cxa_rethrow+0x0)[0xb6dc87a0]
/usr/sbin/../lib/libstdc++.so.6(_ZSt19__throw_logic_errorPKc+0x8f)[0xb6e23fff]
/usr/sbin/../lib/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0xea)[0xb6e3165a]
/usr/sbin/../lib/libstdc++.so.6(_ZNSsC1EPKcRKSaIcE+0x41)[0xb6e31c61]
/usr/sbin/../l ib/libcondor_utils_8_1_1.so(_ZN11DCCollector21getBlacklistTimesliceEv+0x48)[0xb768c218]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN11DCCollector13isBlacklistedEv+0x1c)[0xb768c4fc]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN13CollectorList5queryER11CondorQueryRN14compat_classad11ClassAdListEP11CondorError+0x1ed)[0xb76a1dad]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon13getDaemonInfoE7AdTypesb+0x773)[0xb769e633]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon6locateEv+0x2ca)[0xb769efca]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon17hasUDPCommandPortEv+0x17)[0xb769ab17]
condor_schedd(_ZN9Scheduler14sendRescheduleEv+0x1d8)[0x80883d8]
condor_schedd(_ZN9Scheduler7timeoutEv+0x228)[0x80b6858]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN12TimerManager7TimeoutEPiPd+0x177)[0xb76cfe27]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN10DaemonCore6DriverEv +0x472)[0xb76c3952]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_Z7dc_mainiPPc+0x1779)[0xb76b11a9]
condor_schedd(main+0x58)[0x80691a8]
/usr/sbin/../lib/libc.so.6(__libc_start_main+0xf3)[0xb6b59b73]
condor_schedd[0x80693d5]
02/04/14 18:23:06 (pid:6255) Setting maximum file descriptors to 4096.
02/04/14 18:23:06 (pid:6255) ******************************************************
02/04/14 18:23:06 (pid:6255) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
02/04/14 18:23:06 (pid:6255) ** /usr/sbin/condor_schedd
02/04/14 18:23:06 (pid:6255) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
02/04/14 18:23:06 (pid:6255) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
02/04/14 18:23:06 (pid:6255) ** $CondorVersion: 8.1.1 Oct 25 2013 BuildID: RH-8.1.1-0.3.fc20 $
02/04/14 18:23:06 (pid:6255) ** $CondorPlatform: I686-Fedora_20 $
02/04/14 18:23:06 (pid:6255) ** PID = 6255
02/04/14 18:23:06 (pid:6255) ** Log last touched 2/4 18:22:55
02/04/14 18:23:06 (pid:6255) ******************************************************
---------------------------------------------------

R.L.