[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Accounting STATE_FILE not synced between CMs



Hi guys,

We have two central manager running 8.7.2-1 on SL6 baremetals [1]. The pool is configured to run in a multi-negotiator setup (2 negotiators run on the primary and fallback to the secondary if the primary goes down). Accordingly, we have two HADs and two REPLICATION daemons.

We noticed that the TransferLog on the primary central manager complains about connection being refused when talking to the secondary (snippet from transferer log [2]). On the secondary side, we just see the transferer daemon starting up and then crashing with a core dump file (snippet from transferer log [3]).

From the replication logs, on the secondary central manager, I see that the replication daemon is running the following command:
# /usr/libexec/condor/condor_transferer -f down <131.225.152.22:41450?addrs=131.225.152.22-41450> /var/lib/condor/spool/Version 1 /var/lib/condor/spool/Accountantnew.log

If I try to run this manually, I get an error [4]. We are running the standard configurations in terms of TRANSFERER_* knobs. We do not see this on the central managers of our other pool. One difference is that the pool is running with just one negotiator.

Since the transferer daemon is crashing, the accounting information between the two machines is not in sync. I am not sure if this is even relevant now given that the accounting classads are available in the collector too.

Do you know what might cause these crashes? If you need more details (log files/configuration dumps/core dumps) please let me know.

Best regards,
Farrukh

[1]
# condor_version
$CondorVersion: 8.7.2 Jun 21 2017 BuildID: 408717 $
$CondorPlatform: x86_64_RedHat6 $

[2]
09/12/17 10:28:55 Daemon Log is logging: D_ALWAYS D_ERROR D_COMMAND
09/12/17 10:28:55 Daemoncore: Listening at <0.0.0.0:28035> on TCP (ReliSock) and UDP (SafeSock).
09/12/17 10:28:55 DaemonCore: command socket at <131.225.152.22:28035?addrs=131.225.152.22-28035>
09/12/17 10:28:55 DaemonCore: private command socket at <131.225.152.22:28035?addrs=131.225.152.22-28035>
09/12/17 10:28:55 BaseReplicaTransferer::reinitialize started
09/12/17 10:28:55 attempt to connect to <131.225.152.24:16032> failed: Connection refused (connect errno = 111).
09/12/17 10:28:55 UploadReplicaTransferer::initialize cannot connect to <131.225.152.24:16032>
09/12/17 10:28:55 **** condor_transferer (condor_TRANSFERER) pid 1917761 EXITING WITH STATUS 1

[3]
09/12/17 17:39:48 DaemonCore: command socket at <131.225.152.24:27473?addrs=131.225.152.24-27473>
09/12/17 17:39:48 DaemonCore: private command socket at <131.225.152.24:27473?addrs=131.225.152.24-27473>
09/12/17 17:39:48 Daemoncore: Listening at <0.0.0.0:28591> on TCP (ReliSock) and UDP (SafeSock).
09/12/17 17:39:48 DaemonCore: command socket at <131.225.152.24:28591?addrs=131.225.152.24-28591>
09/12/17 17:39:48 DaemonCore: private command socket at <131.225.152.24:28591?addrs=131.225.152.24-28591>
09/12/17 17:39:48 BaseReplicaTransferer::reinitialize started
09/12/17 17:39:48 DownloadReplicaTransferer::transferFileCommand to <131.225.152.22:41451?addrs=131.225.152.22-41451> started
09/12/17 17:39:48 BaseReplicaTransferer::reinitialize started
09/12/17 17:39:48 DownloadReplicaTransferer::transferFileCommand to <131.225.152.22:41450?addrs=131.225.152.22-41450> started
09/12/17 17:39:48 DownloadReplicaTransferer::transferFileCommand sinful string <131.225.152.24:29262> coded successfully
09/12/17 17:39:48 DownloadReplicaTransferer::transferFileCommand sinful string <131.225.152.24:13853> coded successfully
Stack dump for process 1885408 at tSitmaecskt admupm p1 5f0o5r2 5p6r1o3c8e s(s9Â 1f8r8a5m4e0s9)
at timestamp 1505256138 (9 frames/usr/lib64/libcondor_utils_8_7_2.so(dprintf_dump_stack+0x12d)[0x348ad794dd]
)
/usr/lib64/libcondor_utils_8_7_2.so(_Z18linux_sig_coredumpi+0x40)[0x348af1d930]
/lib64/libpthread.so.0[0x3c3ce0f7e0]
/usr/libexec/condor/condor_transferer(_ZN25DownloadReplicaTransferer19transferFileCommandEv+0x2dc)[0x404adc]
/usr/libexec/condor/condor_transferer(_ZN25DownloadReplicaTransferer10initializeEv+0x11)[0x404b81]
/usr/libexec/condor/condor_transferer(_Z9main_initiPPc+0x1a2)[0x405dc2]
/usr/lib64/libcondor_utils_8_7_2.so(dprintf_dump_stack+0x12d)[0x348ad794dd]
/usr/lib64/libcondor_utils_8_7_2.so(_Z7dc_mainiPPc+0x17d5)[0x348af1fb05]
/usr/lib64/libcondor_utils_8_7_2.so(_Z18linux_sig_coredumpi+0x40)[0x348af1d930]
/lib64/libpthread.so.0[0x3c3ce0f7e0]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c3ca1ed1d]
/usr/libexec/condor/condor_transferer(_ZN25DownloadReplicaTransferer19transferFileCommandEv+0x2dc)[0x404adc]
/usr/libexec/condor/condor_transferer[0x403cf9]
/usr/libexec/condor/condor_transferer(_ZN25DownloadReplicaTransferer10initializeEv+0x11)[0x404b81]
/usr/libexec/condor/condor_transferer(_Z9main_initiPPc+0x1a2)[0x405dc2]
/usr/lib64/libcondor_utils_8_7_2.so(_Z7dc_mainiPPc+0x17d5)[0x348af1fb05]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c3ca1ed1d]
/usr/libexec/condor/condor_transferer[0x403cf9]

[4]
# /usr/libexec/condor/condor_transferer -f down <131.225.152.22:41450?addrs=131.225.152.22-41450> /var/lib/condor/spool/Version 1 /var/lib/condor/spool/Accountantnew.log
-bash: 131.225.152.22:41450?addrs=131.225.152.22-41450: No such file or directory