Re: [HTCondor-users] Condor running slow (q,status,submit)

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

- Jaime

On Mar 26, 2019, at 3:48 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

There was a problem with the size of the logs you posted to this list, but I do have them now.

This is getting beyond my knowledge base.

I see a 30 second gap in the ShadowLog between 2 very small reads from the same socket here.

03/26/19 09:39:24 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=5,timeout=0,flags=0,non_blocking=0)

03/26/19 09:39:54 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=35,timeout=0,flags=0,non_blocking=0)

I donât think this is a connection to the STARTD, because I donât see port 56841 mentioned in the StartLog. Perhaps this is in the StarterLog? or StarterLog.slotN?

At the same time we see this in the StartLog. This is a gap of the same amount of time, but does not appear to be the other side of that conversation.

03/26/19 09:38:56 (D_ALWAYS:2) slot1: Sent update to 1 collector(s)

03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=5,timeout=10,flags=0,non_blocking=0)

03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=137,timeout=10,flags=0,non_blocking=0)

03/26/19 09:39:23 (D_ALWAYS:2) slot1[6.0]: Received job ClassAd update from starter.

03/26/19 09:39:54 (D_NETWORK) ACCEPT bound to <192.168.2.1:51092> fd=10 peer=<192.168.2.1:56843>

03/26/19 09:39:54 (D_SECURITY) DaemonCommandProtocol: Not enough bytes are ready for read.

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=1,flags=2,non_blocking=0)

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=0,flags=0,non_blocking=1)

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=747,timeout=0,flags=0,non_blocking=1)

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Tuesday, March 26, 2019 11:23 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Still seeing a timeout.   in the ShadowLog, a 30 second gap

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.

And a 20 second gap in the StarterLog, followed by a timeout.

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

Lets try turning the logging up even further.

please change your config to this, reconfig and retry

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

TJ,

Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.

Attached are the ShadowLog and StartLog after the reconfig.

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

This is starting to look like something broken in your network stack that is outside of HTCondor. The Sched starts the Shadow at 06:58:03

but the shadow doesnât actually begin logging until 06:58:33. The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files. Are those files on a shared file system perhaps?

The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later. During this time we see this in the StarterLog

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.

Letâs turn on some more logging and maybe get more insight the problem.

please add this to your configuration.

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

Then run condor_reconfig on both submit and execute machines and try again to run a job.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Hereâs a bit more time:

03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:58:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched

03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected

03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)

03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)

03/26/19 06:58:10 (pid:51270) Number of Active Workers 0

03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:59:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected

03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

What is happening in the SchedLog during this 20 second time period?

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Follow-up: firewall on the Mac is turned off.

steve

From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Here you go:

03/26/19 06:58:03 slot1: Request accepted.

03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx

03/26/19 06:58:33 slot1: State change: claiming protocol successful

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

03/26/19 06:58:53 IO: Failed to read packet header

03/26/19 06:58:53 Can't read ClaimId

03/26/19 06:59:03 slot1: Called deactivate_claim()

03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command

03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner

03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/26/19 06:59:03 slot1: State change: IS_OWNER is false

03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed

I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

TJ,

Thanks for responding!

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33   /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted fromslot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

Here is a copy of my condor_config.local as well.

##  Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

USE_SHARED_PORT = False

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

##  this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

NETWORK_INTERFACE = 192.168.2.1

LOCAL_DIR = /usr/local/condor/local.reaper

UID_DOMAIN = reaper.localnet

JAVA = /usr/bin/java

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

MAIL = /usr/bin/mail

FILESYSTEM_DOMAIN = reaper.localnet

LOCK = /tmp/condor-lock.0.21512031190014

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

CONDOR_HOST = 192.168.2.1

CONDOR_IDS = 503.20

NUM_CPUS = 15

SEC_DEFAULT_NEGOTIATION = OPTIONAL

HOSTALLOW_WRITE = 192.168.2.*

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Ok. We need a bit more information in order to figure out what is happening. Lets start with the basics.

Are you running 8.8.1 on all of the nodes?

Does the output of condor_status, show all of your execute nodes?

Does condor_q show your jobs?

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine.   run

condor_config_val shadow_log

on the submit node to find out where that is. You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

Hi all,

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

Thanx

steve

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

NIPR: scupton@xxxxxxx

SIPR: uptonsc@xxxxxxxxxxxxxxxxx

SEED Center website: https://harvest.nps.edu

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Condor running slow (q,status,submit)