[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



I just looked at the latest log files, and it looks like the current 30 second delay is in the condor_starter thatâs launched by the condor_startd. The starter inherits the network socket made by the shadow from the startd. So now we should look in the StarterLog.slot1.

 - Jaime

On Mar 26, 2019, at 3:48 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

There was a problem with the size of the logs you posted to this list, but I do have them now. 
 
This is getting beyond my knowledge base.
 
I see a 30 second gap in the ShadowLog between 2 very small reads from the same socket here. 
 
03/26/19 09:39:24 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=5,timeout=0,flags=0,non_blocking=0)
03/26/19 09:39:54 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=35,timeout=0,flags=0,non_blocking=0)
 
I donât think this is a connection to the STARTD, because I donât see port 56841 mentioned in the StartLog. Perhaps this is in the StarterLog? or StarterLog.slotN?
 
At the same time we see this in the StartLog.  This is a gap of the same amount of time, but does not appear to be the other side of that conversation.
 
03/26/19 09:38:56 (D_ALWAYS:2) slot1: Sent update to 1 collector(s)
03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=5,timeout=10,flags=0,non_blocking=0)
03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=137,timeout=10,flags=0,non_blocking=0)
03/26/19 09:39:23 (D_ALWAYS:2) slot1[6.0]: Received job ClassAd update from starter.
03/26/19 09:39:54 (D_NETWORK) ACCEPT bound to <192.168.2.1:51092> fd=10 peer=<192.168.2.1:56843>
03/26/19 09:39:54 (D_SECURITY) DaemonCommandProtocol: Not enough bytes are ready for read.
03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=1,flags=2,non_blocking=0)
03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=0,flags=0,non_blocking=1)
03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=747,timeout=0,flags=0,non_blocking=1)
 
 
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Tuesday, March 26, 2019 11:23 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Still seeing a timeout.   in the ShadowLog, a 30 second gap
 
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.
 
And a 20 second gap in the StarterLog, followed by a timeout.
 
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
 
Lets try turning the logging up even further.
 
please change your config to this, reconfig and retry
 
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
TJ,
 
Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.
 
Attached are the ShadowLog and StartLog after the reconfig.
 
steve
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
This is starting to look like something broken in your network stack that is outside of HTCondor.   The Sched starts the Shadow at 06:58:03
but the shadow doesnât actually begin logging until 06:58:33.  The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files.  Are those files on a shared file system perhaps?
 
The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later.  During this time we see this in the StarterLog
 
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
 
A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.
 
Letâs turn on some more logging and maybe get more insight  the problem.
 
please add this to your configuration.
 
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
 
Then run condor_reconfig on both submit and execute machines and try again to run a job.
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Hereâs a bit more time:
 
03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:58:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx
03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched
03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected
03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)
03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)
03/26/19 06:58:10 (pid:51270) Number of Active Workers 0
03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:59:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx
03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected
03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108
 
steve
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
What is happening in the SchedLog during this 20 second time period?
 
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Follow-up: firewall on the Mac is turned off.
 
steve
 
From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Here you go:
 
03/26/19 06:58:03 slot1: Request accepted.
03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx
03/26/19 06:58:33 slot1: State change: claiming protocol successful
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
03/26/19 06:58:53 IO: Failed to read packet header
03/26/19 06:58:53 Can't read ClaimId
03/26/19 06:59:03 slot1: Called deactivate_claim()
03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command
03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner
03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/26/19 06:59:03 slot1: State change: IS_OWNER is false
03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed
 
I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.
 
steve
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.
 
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd
 
What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
TJ,
 
Thanks for responding!
 
Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.
Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.
Does condor_q show your jobs? Yes.
 
I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.
 
Hereâs the (partial) output from the ShadowLog:
03/26/19 06:58:33 Using local config sources:
03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local
03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168
03/26/19 06:58:33 CLASSAD_CACHING is OFF
03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR
03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).
03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>
03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>
03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd
03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted fromslot1@xxxxxxxxxxxxxxxxxx
03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.
03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108
 
 
Here is a copy of my condor_config.local as well.
 
##  Where have you installed the bin, sbin and lib condor directories?
 
RELEASE_DIR = /usr/local/condor
ENABLE_IPV6 = False
 
USE_SHARED_PORT = False
 
##  Where is the local condor directory for each host?  This is where the local config file(s), logs and
##  spool/execute directories are located. this is the default for Linux and Unix systems.
##  this is the default on Windows sytems
NETWORK_HOSTNAME = reaper.ern.nps.edu
 
NETWORK_INTERFACE = 192.168.2.1
 
LOCAL_DIR = /usr/local/condor/local.reaper
 
UID_DOMAIN = reaper.localnet
 
JAVA = /usr/bin/java
 
 
MAIL = /usr/bin/mail
 
FILESYSTEM_DOMAIN = reaper.localnet
 
LOCK = /tmp/condor-lock.0.21512031190014
 
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
 
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
 
CONDOR_HOST = 192.168.2.1
 
CONDOR_IDS = 503.20
 
NUM_CPUS = 15
 
SEC_DEFAULT_NEGOTIATION = OPTIONAL
 
HOSTALLOW_WRITE = 192.168.2.*
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "
htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <
mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.
 
Are you running 8.8.1 on all of the nodes?
Does the output of  condor_status, show all of your execute nodes?
Does condor_q show your jobs?
 
if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog
on the submit machine.   run
 
condor_config_val shadow_log
 
on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow
has started up, and then it will identify what job it is attempting to run.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: 
htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <
mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)
 
Hi all,
 
HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.
 
Thanx
steve
 
Stephen C. Upton
Faculty Associate - Research
SEED (Simulation Experiments & Efficient Designs) Center for Data Farming
Operations Research Department
Naval Postgraduate School
Mobile: 804-994-4257
SEED Center website: https://harvest.nps.edu
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/