[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Error from SCHEDD on Windows 10



$CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $

I am having a lot of issues on a Windows "grid" of machines ( all fast
Windows 10 with >32GB of  of RAM and 16 cores for a total of 192
slots) and when submitting 800 jobs in a single .sub file. I use a
single .sub file so the jobs can be monitored by the Condor PERL
script.Typically I seem to have about a 1 in 4 chance the jobs will
all complete.

Perhaps the following error is a clue?

"The SCHEDD's child process with pid 21156 has spent 100.0% of its
time waiting for a lock to its log file.  This could indicate a
scalability limit that could cause system stability problems"

 The latest mystery is 2 (out of 800) zombie jobs that condor says are
running on a particular machine but cannot be found on that machine in
any of the Starter logs. Restarting Condor on both machines resulted
in the following in the ShadowLog. However, although the log says
"Reconnect SUCCESS" the jobs do not start on the 192.168.0.123 machine
and remain in the zombie state.



07/29/19 08:13:58 (153.81) (25524): Attempting to locate disconnected starter
07/29/19 08:13:58 DaemonCore: command socket at
<192.168.0.105:9618?addrs=192.168.0.105-9618&noUDP&sock=24052_65ec_1>
07/29/19 08:13:58 DaemonCore: private command socket at
<192.168.0.105:9618?addrs=192.168.0.105-9618&noUDP&sock=24052_65ec_1>
07/29/19 08:13:58 Initializing a VANILLA shadow for job 153.79
07/29/19 08:13:58 (153.81) (25524): Found starter:
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_734>
07/29/19 08:13:58 (153.81) (25524): Attempting to reconnect to starter
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_734>
07/29/19 08:13:58 (153.79) (17264): Trying to reconnect to disconnected job
07/29/19 08:13:58 (153.79) (17264): LastJobLeaseRenewal: 1564412982
Mon Jul 29 08:09:42 2019
07/29/19 08:13:58 (153.79) (17264): JobLeaseDuration: 2400 seconds
07/29/19 08:13:58 (153.79) (17264): JobLeaseDuration remaining: 2144
07/29/19 08:13:58 (153.79) (17264): Attempting to locate disconnected starter
07/29/19 08:13:58 (153.79) (17264): Found starter:
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_732>
07/29/19 08:13:58 (153.79) (17264): Attempting to reconnect to starter
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_732>
07/29/19 08:13:58 (153.81) (25524): Reconnect SUCCESS: connection re-established
07/29/19 08:13:58 (153.79) (17264): Reconnect SUCCESS: connection re-established