[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



Still seeing a timeout.   in the ShadowLog, a 30 second gap

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.

 

And a 20 second gap in the StarterLog, followed by a timeout.

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

 

Lets try turning the logging up even further.

 

please change your config to this, reconfig and retry

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.

 

Attached are the ShadowLog and StartLog after the reconfig.

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

This is starting to look like something broken in your network stack that is outside of HTCondor.   The Sched starts the Shadow at 06:58:03

but the shadow doesnât actually begin logging until 06:58:33.  The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files.  Are those files on a shared file system perhaps?

 

The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later.  During this time we see this in the StarterLog

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.

 

Letâs turn on some more logging and maybe get more insight  the problem.

 

please add this to your configuration.

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

 

Then run condor_reconfig on both submit and execute machines and try again to run a job.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Hereâs a bit more time:

 

03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:58:03 (pid:51270) Negotiating for owner: condor@xxxxxxxxxxxxxxx

03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched

03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected

03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)

03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 on slot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)

03/26/19 06:58:10 (pid:51270) Number of Active Workers 0

03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:59:03 (pid:51270) Negotiating for owner: condor@xxxxxxxxxxxxxxx

03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected

03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

What is happening in the SchedLog during this 20 second time period?

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Follow-up: firewall on the Mac is turned off.

 

steve

 

From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Here you go:

 

03/26/19 06:58:03 slot1: Request accepted.

03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx

03/26/19 06:58:33 slot1: State change: claiming protocol successful

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

03/26/19 06:58:53 IO: Failed to read packet header

03/26/19 06:58:53 Can't read ClaimId

03/26/19 06:59:03 slot1: Called deactivate_claim()

03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command

03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner

03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/26/19 06:59:03 slot1: State change: IS_OWNER is false

03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed

 

I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.

 

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.

 

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

 

What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Thanks for responding!

 

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

 

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

 

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

 

 

Here is a copy of my condor_config.local as well.

 

##  Where have you installed the bin, sbin and lib condor directories?

 

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

 

USE_SHARED_PORT = False

 

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

##  this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

 

NETWORK_INTERFACE = 192.168.2.1

 

LOCAL_DIR = /usr/local/condor/local.reaper

 

UID_DOMAIN = reaper.localnet

 

JAVA = /usr/bin/java

 

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

 

MAIL = /usr/bin/mail

 

FILESYSTEM_DOMAIN = reaper.localnet

 

LOCK = /tmp/condor-lock.0.21512031190014

 

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

 

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

 

CONDOR_HOST = 192.168.2.1

 

CONDOR_IDS = 503.20

 

NUM_CPUS = 15

 

SEC_DEFAULT_NEGOTIATION = OPTIONAL

 

HOSTALLOW_WRITE = 192.168.2.*

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.

 

Are you running 8.8.1 on all of the nodes?

Does the output of  condor_status, show all of your execute nodes?

Does condor_q show your jobs?

 

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine.   run

 

condor_config_val shadow_log

 

on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

 

Hi all,

 

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

 

Thanx

steve

 

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

NIPR: scupton@xxxxxxx

SIPR: uptonsc@xxxxxxxxxxxxxxxxx

SEED Center website: https://harvest.nps.edu