[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Schedduler not starting



Hi,

Sometimes it happens that the Scheduler does not start anymore. It happens on different machines and different version of htcondor (we used 8.0 or 8.6).
In the SchedLog I get and error by corrupt log.
I can fix it by deleting job-queue.log in the spool folder.

Now I tested with htcondor 8.8.0 and got the same error.

02/19/19 09:09:24 (pid:6712) ******************************************************
02/19/19 09:09:24 (pid:6712) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
02/19/19 09:09:24 (pid:6712) ** C:\condor\bin\condor_schedd.exe
02/19/19 09:09:24 (pid:6712) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
02/19/19 09:09:24 (pid:6712) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
02/19/19 09:09:24 (pid:6712) ** $CondorVersion: 8.8.0 Jan 03 2019 BuildID: 457757 $
02/19/19 09:09:24 (pid:6712) ** $CondorPlatform: x86_64_Windows10 $
02/19/19 09:09:24 (pid:6712) ** PID = 6712
02/19/19 09:09:24 (pid:6712) ** Log last touched 2/19 08:35:06
02/19/19 09:09:24 (pid:6712) ******************************************************
02/19/19 09:09:24 (pid:6712) Using config source: C:\condor\condor_config
02/19/19 09:09:24 (pid:6712) Using local config sources:
02/19/19 09:09:24 (pid:6712)    C:\condor/condor_config.local
02/19/19 09:09:24 (pid:6712) config Macros = 187, Sorted = 187, StringBytes = 5237, TablesBytes = 6780
02/19/19 09:09:24 (pid:6712) CLASSAD_CACHING is ENABLED
02/19/19 09:09:24 (pid:6712) Daemon Log is logging: D_ALWAYS D_ERROR
02/19/19 09:09:24 (pid:6712) DaemonCore: non-shared command socket at <192.168.0.27:1134>
02/19/19 09:09:24 (pid:6712) Daemoncore: Listening at <0.0.0.0:1134> on TCP (ReliSock) and UDP (SafeSock).
02/19/19 09:09:24 (pid:6712) DaemonCore: command socket at <192.168.56.1:9618?addrs=192.168.56.1-9618&noUDP&sock=6712_dfde>
02/19/19 09:09:24 (pid:6712) DaemonCore: private command socket at <192.168.56.1:9618?addrs=192.168.56.1-9618&noUDP&sock=6712_dfde>
02/19/19 09:09:24 (pid:6712) History file rotation is enabled.
02/19/19 09:09:24 (pid:6712)   Maximum history file size is: 20971520 bytes
02/19/19 09:09:24 (pid:6712)   Number of rotated history files is: 2
02/19/19 09:09:24 (pid:6712) NOTE: QUEUE_ALL_USERS_TRUSTED=TRUE - all queue access checks disabled!
02/19/19 09:09:24 (pid:6712) WARNING: Encountered corrupt log record 211 (byte offset 8374)
02/19/19 09:09:24 (pid:6712)     999  
02/19/19 09:09:24 (pid:6712) Lines following corrupt log record 211 (up to 3):
02/19/19 09:09:24 (pid:6712)     103 7.0 MachineAttrSlotWeight0 2
02/19/19 09:09:24 (pid:6712)     103 7.0 StartdPrincipal "execute-side@matchsession/192.168.0.27"
02/19/19 09:09:24 (pid:6712)     103 7.0 ShadowBday 1550250864
02/19/19 09:09:24 (pid:6712) ERROR "Error: corrupt log record 211 (byte offset 8374) occurred inside closed transaction, recovery failed" at line 1114 in file C:\condor\execute\dir_6124\sources\src\condor_utils\classad_log.cpp
02/19/19 09:09:24 (pid:6712) Cron: Killing all jobs
02/19/19 09:09:24 (pid:6712) CronJobList: Deleting all jobs
02/19/19 09:09:24 (pid:6712) Cron: Killing all jobs
02/19/19 09:09:24 (pid:6712) CronJobList: Deleting all jobs


What could be the reason for this?
Is there any idea to avoid this error?

Best regards,
Werner