[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedduler not starting



We don’t know of any specific HTCondor bug that causes this.

Could you send us the corrupted job_queue.log log so we can take a look?

you can email it  to me at johnkn@xxxxxxxxxxx

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Werner Koppelstätter
Sent: Tuesday, February 19, 2019 9:11 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Schedduler not starting

 

Hi,

 

Sometimes it happens that the Scheduler does not start anymore. It happens on different machines and different version of htcondor (we used 8.0 or 8.6).

In the SchedLog I get and error by corrupt log.

I can fix it by deleting job-queue.log in the spool folder.

 

Now I tested with htcondor 8.8.0 and got the same error.

 

02/19/19 09:09:24 (pid:6712) ******************************************************
02/19/19 09:09:24 (pid:6712) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
02/19/19 09:09:24 (pid:6712) ** C:\condor\bin\condor_schedd.exe
02/19/19 09:09:24 (pid:6712) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
02/19/19 09:09:24 (pid:6712) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
02/19/19 09:09:24 (pid:6712) ** $CondorVersion: 8.8.0 Jan 03 2019 BuildID: 457757 $
02/19/19 09:09:24 (pid:6712) ** $CondorPlatform: x86_64_Windows10 $
02/19/19 09:09:24 (pid:6712) ** PID = 6712
02/19/19 09:09:24 (pid:6712) ** Log last touched 2/19 08:35:06
02/19/19 09:09:24 (pid:6712) ******************************************************
02/19/19 09:09:24 (pid:6712) Using config source: C:\condor\condor_config
02/19/19 09:09:24 (pid:6712) Using local config sources:
02/19/19 09:09:24 (pid:6712)    C:\condor/condor_config.local
02/19/19 09:09:24 (pid:6712) config Macros = 187, Sorted = 187, StringBytes = 5237, TablesBytes = 6780
02/19/19 09:09:24 (pid:6712) CLASSAD_CACHING is ENABLED
02/19/19 09:09:24 (pid:6712) Daemon Log is logging: D_ALWAYS D_ERROR
02/19/19 09:09:24 (pid:6712) DaemonCore: non-shared command socket at <192.168.0.27:1134>
02/19/19 09:09:24 (pid:6712) Daemoncore: Listening at <0.0.0.0:1134> on TCP (ReliSock) and UDP (SafeSock).
02/19/19 09:09:24 (pid:6712) DaemonCore: command socket at <192.168.56.1:9618?addrs=192.168.56.1-9618&noUDP&sock=6712_dfde>
02/19/19 09:09:24 (pid:6712) DaemonCore: private command socket at <192.168.56.1:9618?addrs=192.168.56.1-9618&noUDP&sock=6712_dfde>
02/19/19 09:09:24 (pid:6712) History file rotation is enabled.
02/19/19 09:09:24 (pid:6712)   Maximum history file size is: 20971520 bytes
02/19/19 09:09:24 (pid:6712)   Number of rotated history files is: 2
02/19/19 09:09:24 (pid:6712) NOTE: QUEUE_ALL_USERS_TRUSTED=TRUE - all queue access checks disabled!
02/19/19 09:09:24 (pid:6712) WARNING: Encountered corrupt log record 211 (byte offset 8374)
02/19/19 09:09:24 (pid:6712)     999  
02/19/19 09:09:24 (pid:6712) Lines following corrupt log record 211 (up to 3):
02/19/19 09:09:24 (pid:6712)     103 7.0 MachineAttrSlotWeight0 2
02/19/19 09:09:24 (pid:6712)     103 7.0 StartdPrincipal "execute-side@matchsession/192.168.0.27"
02/19/19 09:09:24 (pid:6712)     103 7.0 ShadowBday 1550250864
02/19/19 09:09:24 (pid:6712) ERROR "Error: corrupt log record 211 (byte offset 8374) occurred inside closed transaction, recovery failed" at line 1114 in file C:\condor\execute\dir_6124\sources\src\condor_utils\classad_log.cpp
02/19/19 09:09:24 (pid:6712) Cron: Killing all jobs
02/19/19 09:09:24 (pid:6712) CronJobList: Deleting all jobs
02/19/19 09:09:24 (pid:6712) Cron: Killing all jobs
02/19/19 09:09:24 (pid:6712) CronJobList: Deleting all jobs

 

 

What could be the reason for this?

Is there any idea to avoid this error?

 

Best regards,

Werner