[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [HTCondor-Users] condor_job_queue.log corruption making condor_schedd die.



Hello Experts,Â

We are facing corruption issue job_queue.log file. We are keeping this file in /dev/shm because of performance reasons.Â

Yesterday following message reported in sched log file during the same time our system was OOM. error no 28 (no space) is expected.Â

07/21/20 12:20:28 (pid:959121) ERROR "Failed to write real job queue log: fflush failed (errno 28); no local backup available." at line 553 in file /slots/06/dir_3214211/userdir/.tmpE5TmSx/BUILD/condor-8.5.8/src/condor_utils/log_transaction.cpp
07/21/20 12:20:28 (pid:959121) Cron: Killing all jobs
07/21/20 12:20:28 (pid:959121) CronJobList: Deleting all jobs
07/21/20 12:20:28 (pid:959121) Cron: Killing all jobs
07/21/20 12:20:28 (pid:959121) CronJobList: Deleting all jobs
07/21/20 12:20:54 (pid:3862465) Setting maximum file descriptors to 10240.

Sched tried to come up many times but it was keep on failing because of corruption inÂ/dev/shm/condor_job_queue.log

07/21/20 12:21:37 (pid:3862465) WARNING: Encountered corrupt log record 7273612 (byte offset 271118318)
07/21/20 12:21:37 (pid:3862465) Â Â 103 95760.510 RecentSt105
07/21/20 12:21:37 (pid:3862465) Lines following corrupt log record 7273612 (up to 3):
07/21/20 12:21:37 (pid:3862465) Â Â 103 95765.727 BlockReadKbytes 0
07/21/20 12:21:37 (pid:3862465) Â Â 103 95765.727 BlockReads 0
07/21/20 12:21:37 (pid:3862465) Â Â 103 95765.727 BlockWriteKbytes 0
--
07/21/20 12:22:30 (pid:3862995) WARNING: Encountered corrupt log record 7273612 (byte offset 271118318)
07/21/20 12:22:30 (pid:3862995) Â Â 103 95760.510 RecentSt105
07/21/20 12:22:30 (pid:3862995) Lines following corrupt log record 7273612 (up to 3):
07/21/20 12:22:30 (pid:3862995) Â Â 103 95765.727 BlockReadKbytes 0
07/21/20 12:22:30 (pid:3862995) Â Â 103 95765.727 BlockReads 0
07/21/20 12:22:30 (pid:3862995) Â Â 103 95765.727 BlockWriteKbytes 0
--
07/21/20 12:23:26 (pid:3863350) WARNING: Encountered corrupt log record 7273612 (byte offset 271118318)
07/21/20 12:23:26 (pid:3863350) Â Â 103 95760.510 RecentSt105
07/21/20 12:23:26 (pid:3863350) Lines following corrupt log record 7273612 (up to 3):
07/21/20 12:23:26 (pid:3863350) Â Â 103 95765.727 BlockReadKbytes 0
07/21/20 12:23:26 (pid:3863350) Â Â 103 95765.727 BlockReads 0
07/21/20 12:23:26 (pid:3863350) Â Â 103 95765.727 BlockWriteKbytes 0

Manual condor restart was also not helping ideally it should have rotated the file. Moved the file manually to make condor restart successul. We lost all queue jobs :(Â

After that couple of times we have encountered the same issue.Â

Example: Again todayÂ

07/22/20 02:13:42 (pid:1325065) WARNING: Encountered corrupt log record 3032919 (byte offset 106815447)
07/22/20 02:13:42 (pid:1325065) Â Â 103 20.2507 BlockWr105
07/22/20 02:13:42 (pid:1325065) Lines following corrupt log record 3032919 (up to 3):
07/22/20 02:13:42 (pid:1325065) Â Â 103 20.3944 BlockReadKbytes 0
07/22/20 02:13:42 (pid:1325065) Â Â 103 20.3944 BlockReads 0
07/22/20 02:13:42 (pid:1325065) Â Â 103 20.3944 BlockWriteKbytes 0

Sometime it throws following message

07/21/20 22:10:51 (pid:561539) WARNING: Encountered corrupt log record 361650 (byte offset 12713949)
07/21/20 22:10:51 (pid:561539) Â Â 103 8.1142 LastJobLeaseRenewal
07/21/20 22:10:51 (pid:561539) Lines following corrupt log record 361650 (up to 3):
07/21/20 22:10:51 (pid:561539) ClassAdLog /dev/shm/condor_job_queue.log has the following issues: Detected unterminated log entry

07/21/20 22:10:51 (pid:561539) About to rotate ClassAd log /dev/shm/condor_job_queue.log


Configuration settings.Â

ChangedÂQUEUE_CLEAN_INTERVAL from 1 hour to 10 minutes.Â

# condor_config_val -dump | grep -i _QUEUE_

JOB_QUEUE_LOG = /dev/shm/condor_job_queue.log
MAX_JOB_QUEUE_LOG_ROTATIONS = 1
PANDA_QUEUE_GRACE = 3
PANDA_QUEUE_SIZE = 131072
SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY = 5
SHADOW_LAZY_QUEUE_UPDATE = true
SHADOW_QUEUE_UPDATE_INTERVAL = 60
TRANSFER_QUEUE_USER_EXPR = strcat("Owner_",Owner)

condor_versionÂ8.5.8 We never encountered this issue earlier with same version.Â

Can anyone please provide some help on troubleshooting it?

Regards,
Vikrant Aggarwal