[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error from SCHEDD on Windows 10



The message about "waiting for a lock"  is indeed a problem.   What do you have SHADOW_DEBUG and/or ALL_DEBUG set to ?

is 153.79 one of the zombie jobs?  or is that a message about a job that is/was actually running?

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Andrew Cunningham
Sent: Monday, July 29, 2019 10:51 AM
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Error from SCHEDD on Windows 10

$CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $

I am having a lot of issues on a Windows "grid" of machines ( all fast
Windows 10 with >32GB of  of RAM and 16 cores for a total of 192
slots) and when submitting 800 jobs in a single .sub file. I use a
single .sub file so the jobs can be monitored by the Condor PERL
script.Typically I seem to have about a 1 in 4 chance the jobs will
all complete.

Perhaps the following error is a clue?

"The SCHEDD's child process with pid 21156 has spent 100.0% of its
time waiting for a lock to its log file.  This could indicate a
scalability limit that could cause system stability problems"

 The latest mystery is 2 (out of 800) zombie jobs that condor says are
running on a particular machine but cannot be found on that machine in
any of the Starter logs. Restarting Condor on both machines resulted
in the following in the ShadowLog. However, although the log says
"Reconnect SUCCESS" the jobs do not start on the 192.168.0.123 machine
and remain in the zombie state.



07/29/19 08:13:58 (153.81) (25524): Attempting to locate disconnected starter
07/29/19 08:13:58 DaemonCore: command socket at
<192.168.0.105:9618?addrs=192.168.0.105-9618&noUDP&sock=24052_65ec_1>
07/29/19 08:13:58 DaemonCore: private command socket at
<192.168.0.105:9618?addrs=192.168.0.105-9618&noUDP&sock=24052_65ec_1>
07/29/19 08:13:58 Initializing a VANILLA shadow for job 153.79
07/29/19 08:13:58 (153.81) (25524): Found starter:
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_734>
07/29/19 08:13:58 (153.81) (25524): Attempting to reconnect to starter
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_734>
07/29/19 08:13:58 (153.79) (17264): Trying to reconnect to disconnected job
07/29/19 08:13:58 (153.79) (17264): LastJobLeaseRenewal: 1564412982
Mon Jul 29 08:09:42 2019
07/29/19 08:13:58 (153.79) (17264): JobLeaseDuration: 2400 seconds
07/29/19 08:13:58 (153.79) (17264): JobLeaseDuration remaining: 2144
07/29/19 08:13:58 (153.79) (17264): Attempting to locate disconnected starter
07/29/19 08:13:58 (153.79) (17264): Found starter:
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_732>
07/29/19 08:13:58 (153.79) (17264): Attempting to reconnect to starter
<192.168.0.123:9618?addrs=192.168.0.123-9618&noUDP&sock=5352_88d8_732>
07/29/19 08:13:58 (153.81) (25524): Reconnect SUCCESS: connection re-established
07/29/19 08:13:58 (153.79) (17264): Reconnect SUCCESS: connection re-established
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/