[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



Hi All,

 

Iâm hoping this wasnât a wild goose chase, but I was informed:

There was a problem with the service order for network connections (reaper has two connections, one internal and one external).

 

We were also having issues connecting to the outside on our central manager. After we switched the service order, things are now running fine, condor is very responsive. The ProcLog file permissions were also too constrained. I will now update and connect our Windows machines.

 

Thanks all!

steve

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 5:55 PM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

I just looked at the latest log files, and it looks like the current 30 second delay is in the condor_starter thatâs launched by the condor_startd. The starter inherits the network socket made by the shadow from the startd. So now we should look in the StarterLog.slot1.

 

 - Jaime



On Mar 26, 2019, at 3:48 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

 

There was a problem with the size of the logs you posted to this list, but I do have them now. 

 

This is getting beyond my knowledge base.

 

I see a 30 second gap in the ShadowLog between 2 very small reads from the same socket here. 

 

03/26/19 09:39:24 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=5,timeout=0,flags=0,non_blocking=0)

03/26/19 09:39:54 (D_NETWORK) (6.0) (874): condor_read(fd=6 <192.168.2.1:56841>,,size=35,timeout=0,flags=0,non_blocking=0)

 

I donât think this is a connection to the STARTD, because I donât see port 56841 mentioned in the StartLog. Perhaps this is in the StarterLog? or StarterLog.slotN?

 

At the same time we see this in the StartLog.  This is a gap of the same amount of time, but does not appear to be the other side of that conversation.

 

03/26/19 09:38:56 (D_ALWAYS:2) slot1: Sent update to 1 collector(s)

03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=5,timeout=10,flags=0,non_blocking=0)

03/26/19 09:39:23 (D_NETWORK) condor_read(fd=15 <127.0.0.1:56837>,,size=137,timeout=10,flags=0,non_blocking=0)

03/26/19 09:39:23 (D_ALWAYS:2) slot1[6.0]: Received job ClassAd update from starter.

03/26/19 09:39:54 (D_NETWORK) ACCEPT bound to <192.168.2.1:51092> fd=10 peer=<192.168.2.1:56843>

03/26/19 09:39:54 (D_SECURITY) DaemonCommandProtocol: Not enough bytes are ready for read.

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=1,flags=2,non_blocking=0)

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=5,timeout=0,flags=0,non_blocking=1)

03/26/19 09:39:54 (D_NETWORK) condor_read(fd=10 <192.168.2.1:56843>,,size=747,timeout=0,flags=0,non_blocking=1)

 

 

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Tuesday, March 26, 2019 11:23 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Still seeing a timeout.   in the ShadowLog, a 30 second gap

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.

 

And a 20 second gap in the StarterLog, followed by a timeout.

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

 

Lets try turning the logging up even further.

 

please change your config to this, reconfig and retry

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.

 

Attached are the ShadowLog and StartLog after the reconfig.

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

This is starting to look like something broken in your network stack that is outside of HTCondor.   The Sched starts the Shadow at 06:58:03

but the shadow doesnât actually begin logging until 06:58:33.  The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files.  Are those files on a shared file system perhaps?

 

The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later.  During this time we see this in the StarterLog

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.

 

Letâs turn on some more logging and maybe get more insight  the problem.

 

please add this to your configuration.

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

 

Then run condor_reconfig on both submit and execute machines and try again to run a job.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Hereâs a bit more time:

 

03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:58:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched

03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected

03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)

03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)

03/26/19 06:58:10 (pid:51270) Number of Active Workers 0

03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:59:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected

03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

What is happening in the SchedLog during this 20 second time period?

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Follow-up: firewall on the Mac is turned off.

 

steve

 

From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Here you go:

 

03/26/19 06:58:03 slot1: Request accepted.

03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx

03/26/19 06:58:33 slot1: State change: claiming protocol successful

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

03/26/19 06:58:53 IO: Failed to read packet header

03/26/19 06:58:53 Can't read ClaimId

03/26/19 06:59:03 slot1: Called deactivate_claim()

03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command

03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner

03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/26/19 06:59:03 slot1: State change: IS_OWNER is false

03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed

 

I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.

 

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.

 

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

 

What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Thanks for responding!

 

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

 

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

 

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted fromslot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

 

 

Here is a copy of my condor_config.local as well.

 

##  Where have you installed the bin, sbin and lib condor directories?

 

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

 

USE_SHARED_PORT = False

 

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

##  this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

 

NETWORK_INTERFACE = 192.168.2.1

 

LOCAL_DIR = /usr/local/condor/local.reaper

 

UID_DOMAIN = reaper.localnet

 

JAVA = /usr/bin/java

 

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

 

MAIL = /usr/bin/mail

 

FILESYSTEM_DOMAIN = reaper.localnet

 

LOCK = /tmp/condor-lock.0.21512031190014

 

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

 

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

 

CONDOR_HOST = 192.168.2.1

 

CONDOR_IDS = 503.20

 

NUM_CPUS = 15

 

SEC_DEFAULT_NEGOTIATION = OPTIONAL

 

HOSTALLOW_WRITE = 192.168.2.*

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.

 

Are you running 8.8.1 on all of the nodes?

Does the output of  condor_status, show all of your execute nodes?

Does condor_q show your jobs?

 

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine.   run

 

condor_config_val shadow_log

 

on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

 

Hi all,

 

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

 

Thanx

steve

 

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

SEED Center website: https://harvest.nps.edu

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/