[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



The 30 second delay in the shadow log happens right when it evaluates the security settings to decide if it should be talking to the remote machine. This can trigger a number of DNS queries to determine how any hostnames set in the configuration resolve to IP addresses. This is probably a DNS issue.

 - Jaime

On Mar 26, 2019, at 11:23 AM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

Still seeing a timeout.   in the ShadowLog, a 30 second gap
 
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.
 
And a 20 second gap in the StarterLog, followed by a timeout.
 
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
 
Lets try turning the logging up even further.
 
please change your config to this, reconfig and retry
 
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
TJ,
 
Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.
 
Attached are the ShadowLog and StartLog after the reconfig.
 
steve
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
This is starting to look like something broken in your network stack that is outside of HTCondor.   The Sched starts the Shadow at 06:58:03
but the shadow doesnât actually begin logging until 06:58:33.  The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files.  Are those files on a shared file system perhaps?
 
The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later.  During this time we see this in the StarterLog
 
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
 
A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.
 
Letâs turn on some more logging and maybe get more insight  the problem.
 
please add this to your configuration.
 
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
 
Then run condor_reconfig on both submit and execute machines and try again to run a job.
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Hereâs a bit more time:
 
03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:58:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx
03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched
03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected
03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)
03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)
03/26/19 06:58:10 (pid:51270) Number of Active Workers 0
03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:59:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx
03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected
03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108
 
steve
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
What is happening in the SchedLog during this 20 second time period?
 
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Follow-up: firewall on the Mac is turned off.
 
steve
 
From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Here you go:
 
03/26/19 06:58:03 slot1: Request accepted.
03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx
03/26/19 06:58:33 slot1: State change: claiming protocol successful
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
03/26/19 06:58:53 IO: Failed to read packet header
03/26/19 06:58:53 Can't read ClaimId
03/26/19 06:59:03 slot1: Called deactivate_claim()
03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command
03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner
03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/26/19 06:59:03 slot1: State change: IS_OWNER is false
03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed
 
I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.
 
steve
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.
 
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd
 
What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
TJ,
 
Thanks for responding!
 
Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.
Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.
Does condor_q show your jobs? Yes.
 
I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.
 
Hereâs the (partial) output from the ShadowLog:
03/26/19 06:58:33 Using local config sources:
03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local
03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168
03/26/19 06:58:33 CLASSAD_CACHING is OFF
03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR
03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).
03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>
03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>
03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd
03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted fromslot1@xxxxxxxxxxxxxxxxxx
03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.
03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108
 
 
Here is a copy of my condor_config.local as well.
 
##  Where have you installed the bin, sbin and lib condor directories?
 
RELEASE_DIR = /usr/local/condor
ENABLE_IPV6 = False
 
USE_SHARED_PORT = False
 
##  Where is the local condor directory for each host?  This is where the local config file(s), logs and
##  spool/execute directories are located. this is the default for Linux and Unix systems.
##  this is the default on Windows sytems
NETWORK_HOSTNAME = reaper.ern.nps.edu
 
NETWORK_INTERFACE = 192.168.2.1
 
LOCAL_DIR = /usr/local/condor/local.reaper
 
UID_DOMAIN = reaper.localnet
 
JAVA = /usr/bin/java
 
 
MAIL = /usr/bin/mail
 
FILESYSTEM_DOMAIN = reaper.localnet
 
LOCK = /tmp/condor-lock.0.21512031190014
 
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
 
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
 
CONDOR_HOST = 192.168.2.1
 
CONDOR_IDS = 503.20
 
NUM_CPUS = 15
 
SEC_DEFAULT_NEGOTIATION = OPTIONAL
 
HOSTALLOW_WRITE = 192.168.2.*
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)
 
Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.
 
Are you running 8.8.1 on all of the nodes?
Does the output of  condor_status, show all of your execute nodes?
Does condor_q show your jobs?
 
if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog
on the submit machine.   run
 
condor_config_val shadow_log
 
on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow
has started up, and then it will identify what job it is attempting to run.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)
 
Hi all,
 
HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.
 
Thanx
steve
 
Stephen C. Upton
Faculty Associate - Research
SEED (Simulation Experiments & Efficient Designs) Center for Data Farming
Operations Research Department
Naval Postgraduate School
Mobile: 804-994-4257
SEED Center website: https://harvest.nps.edu
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/