[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fedora 20 / Condor 8.1.1 : condor_schedd crashes upon condor_submit.



Meanwhile I have found more on this issue:

There is a Fedora configuration issue in /etc/tmpfiles.d/condor.conf, that is caused by the default setting of LOCAL_DISK_LOCK_DIR in the condor config:

/etc/tmpfiles.d/condor.conf:

#################
d /var/run/condor 0775 condor condor -
d /var/lock/condor 0775 condor condor -
#d /var/lock/condor/local 0775 condor condor -
# https://bugzilla.redhat.com/show_bug.cgi?id=741653
# https://bugzilla.redhat.com/show_bug.cgi?id=1029365
# Locks are written by the daemon condor_shadow,
# which runs NOT as user 'condor' but as the user
# who submitted the job. The lock directory is determined by
# the config variable LOCAL_DISK_LOCK_DIR, which defaults
# to /var/lock/condor/local.
d /var/lock/condor/local 1775 condor condor -
#################

Isn't it a problem that the 'hard-coded' system file here depends on a variable configuration setting in HTCondor?


I've noticed another problem:
HTCondor switches to using /tmp/condorLocks instead when /var/lock/condor/local is not available. I think /tmp/condorLocks is not cleaned up properly as it contains files and directories, even though there are no jobs running or in the queue anymore.

R.


On Sunday, January 26, 2014 10:26 PM, Stub <spamrefuse@xxxxxxxxx> wrote:
Hi,

I recently have upgraded from Fedora 18 to 20, which also comes with a newer version of condor: 8.1.1

I am kind of aware of a problem with Fedora and Condor 8.x.x (https://bugzilla.redhat.com/show_bug.cgi?id=1000106), for which I set in the condor config:

USE_CLONE_TO_CREATE_PROCESSES = False


Each time when I submit a job, the condor_schedd crashes; see below.

Then the job appears in the queue and later is scheduled as usual.

Has this issue been solved in a later release of condor?
I haven't seen anything related in the release notes of 8.1.2 and 8.1.2......
Or is this a Fedora package error?

Regards,
Rob.

===========
Contents of /var/log/condor/SchedLog, at the moment of the crash:

01/26/14 22:12:44 (pid:7805) directory_util::rec_touch_file: Directory /var/lock/condor/local//19 cannot be created (Permission denied) 
01/26/14 22:12:44 (pid:7805) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/26/14 22:12:44 (pid:7805) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
01/26/14 22:12:44 (pid:7805) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
01/26/14 22:12:44 (pid:7805) Failed to start non-blocking update to unknown.
01/26/14 22:12:44 (pid:7805) Sent ad to central manager for lahaye@xxxxxxxxxxxxxxx
01/26/14 22:12:44 (pid:7805) Sent ad to 1 collectors for lahaye@xxxxxxxxxxxxxxx
Stack dump for process 7805 at timestamp 1390741964 (27 frames)
/usr/sbin/../lib/libcondor_utils_8_1_1.so(dprintf_dump_stack+0x66)[0xb74d2ce6]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(+0x17e106)[0xb7563106]
[0xb7727400]
[0xb7727424]
/usr/sbin/../lib/libc.so.6(gsignal+0x46)[0xb6acaba6]
/usr/sbin/../lib/libc.so.6(abort+0x143)[0xb6acc3e3]
/usr/sbin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1a5)[0xb6d26ab5]
/usr/sbin/../lib/libstdc++.so.6(+0x4a4f4)[0xb6d244f4]
/usr/sbin/../lib/libstdc++.so.6(+0x4a530)[0xb6d24530]
/usr/sbin/../lib/libstdc++.so.6(__cxa_rethrow+0x0)[0xb6d247a0]
/usr/sbin/../lib/libstdc++.so.6(_ZSt19__throw_logic_errorPKc+0x8f)[0xb6d7ffff]
/usr/sbin/../lib/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0xea)[0xb6d8d65a]
/usr/sbin/../lib/libstdc++.so.6(_ZNSsC1EPKcRKSaIcE+0x41)[0xb6d8dc61]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN11DCCollector21getBlacklistTimesliceEv+0x48)[0xb75e8218]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN11DCCollector13isBlacklistedEv+0x1c)[0xb75e84fc]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN13CollectorList5queryER11CondorQueryRN14compat_classad11ClassAdListEP11CondorError+0x1ed)[0xb75fddad]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon13getDaemonInfoE7AdTypesb+0x773)[0xb75fa633]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon6locateEv+0x2ca)[0xb75fafca]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN6Daemon17hasUDPCommandPortEv+0x17)[0xb75f6b17]
condor_schedd(_ZN9Scheduler14sendRescheduleEv+0x1d8)[0x80883d8]
condor_schedd(_ZN9Scheduler7timeoutEv+0x228)[0x80b6858]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN12TimerManager7TimeoutEPiPd+0x177)[0xb762be27]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_ZN10DaemonCore6DriverEv+0x472)[0xb761f952]
/usr/sbin/../lib/libcondor_utils_8_1_1.so(_Z7dc_mainiPPc+0x1779)[0xb760d1a9]
condor_schedd(main+0x58)[0x80691a8]
/usr/sbin/../lib/libc.so.6(__libc_start_main+0xf3)[0xb6ab5b83]
condor_schedd[0x80693d5]