[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



Hi Jaime,

 

I added a mapping 192.168.2.1 to reaper.ern.nps.edu in my machineâs /etc/host file and jobs are at least getting submitted, although there is still a delay, but about 10 secs instead of the 30. Iâm looking at my CollectorLog, and I see these entries:

 

03/26/19 11:12:25 MasterAd     : Inserting ** "< grim-pro-04.reaper.localnet >"

03/26/19 11:12:25 MasterAd     : Inserting ** "< grim-pro-01.reaper.localnet >"

03/26/19 11:22:45 SubmittorAd  : Inserting ** "< condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.2.1 >"

03/26/19 11:23:15 AccountingAd  : Inserting ** "< condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >"

 

 

Where/why would condor be appending âreaper.ern.nps.eduâ to the Ad, and might that interfere/delay a DNS lookup?

 

Thanks for getting me this far!! We at least have jobs running, albeit still slowly being submitted.

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stephen Upton <scupton@xxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 2:02 PM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

HI Jaime,

 

Iâm exposing my network ignorance here, but itâs the same machine I have condor running on and submitting jobs to, i.e., there is no âremote machineâ. I have some idea Iâm supposed to map name to ip address, e.g., /etc/hosts, but only guessing? I realize weâve probably gone off topic for this list and Iâll have to do some more googling.

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 1:40 PM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

The 30 second delay in the shadow log happens right when it evaluates the security settings to decide if it should be talking to the remote machine. This can trigger a number of DNS queries to determine how any hostnames set in the configuration resolve to IP addresses. This is probably a DNS issue.

 

 - Jaime




On Mar 26, 2019, at 11:23 AM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

 

Still seeing a timeout.   in the ShadowLog, a 30 second gap

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.

 

And a 20 second gap in the StarterLog, followed by a timeout.

 

03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)

03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7

03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0

 

Lets try turning the logging up even further.

 

please change your config to this, reconfig and retry

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 11:08 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.

 

Attached are the ShadowLog and StartLog after the reconfig.

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:39 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

This is starting to look like something broken in your network stack that is outside of HTCondor.   The Sched starts the Shadow at 06:58:03

but the shadow doesnât actually begin logging until 06:58:33.  The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files.  Are those files on a shared file system perhaps?

 

The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later.  During this time we see this in the StarterLog

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

A timeout after 20 seconds.   This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.

 

Letâs turn on some more logging and maybe get more insight  the problem.

 

please add this to your configuration.

 

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

 

Then run condor_reconfig on both submit and execute machines and try again to run a job.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Hereâs a bit more time:

 

03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:58:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched

03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected

03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)

03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)

03/26/19 06:58:10 (pid:51270) Number of Active Workers 0

03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:59:03 (pid:51270) Negotiating for owner:condor@xxxxxxxxxxxxxxx

03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected

03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108

 

steve

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

What is happening in the SchedLog during this 20 second time period?

 

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Follow-up: firewall on the Mac is turned off.

 

steve

 

From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Here you go:

 

03/26/19 06:58:03 slot1: Request accepted.

03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx

03/26/19 06:58:33 slot1: State change: claiming protocol successful

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

03/26/19 06:58:53 IO: Failed to read packet header

03/26/19 06:58:53 Can't read ClaimId

03/26/19 06:59:03 slot1: Called deactivate_claim()

03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command

03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner

03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/26/19 06:59:03 slot1: State change: IS_OWNER is false

03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed

 

I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.

 

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.

 

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

 

What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

TJ,

 

Thanks for responding!

 

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

 

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

 

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted fromslot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

 

 

Here is a copy of my condor_config.local as well.

 

##  Where have you installed the bin, sbin and lib condor directories?

 

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

 

USE_SHARED_PORT = False

 

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

##  this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

 

NETWORK_INTERFACE = 192.168.2.1

 

LOCAL_DIR = /usr/local/condor/local.reaper

 

UID_DOMAIN = reaper.localnet

 

JAVA = /usr/bin/java

 

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

 

MAIL = /usr/bin/mail

 

FILESYSTEM_DOMAIN = reaper.localnet

 

LOCK = /tmp/condor-lock.0.21512031190014

 

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

 

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

 

CONDOR_HOST = 192.168.2.1

 

CONDOR_IDS = 503.20

 

NUM_CPUS = 15

 

SEC_DEFAULT_NEGOTIATION = OPTIONAL

 

HOSTALLOW_WRITE = 192.168.2.*

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

 

Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.

 

Are you running 8.8.1 on all of the nodes?

Does the output of  condor_status, show all of your execute nodes?

Does condor_q show your jobs?

 

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine.   run

 

condor_config_val shadow_log

 

on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

 

Hi all,

 

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

 

Thanx

steve

 

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

SEED Center website: https://harvest.nps.edu

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/