Re: [HTCondor-users] Condor running slow (q,status,submit)

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

TJ,

Thanks for responding!

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33 /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

Here is a copy of my condor_config.local as well.

## Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

USE_SHARED_PORT = False

## Where is the local condor directory for each host? This is where the local config file(s), logs and

## spool/execute directories are located. this is the default for Linux and Unix systems.

## this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

NETWORK_INTERFACE = 192.168.2.1

LOCAL_DIR = /usr/local/condor/local.reaper

UID_DOMAIN = reaper.localnet

JAVA = /usr/bin/java

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

MAIL = /usr/bin/mail

FILESYSTEM_DOMAIN = reaper.localnet

LOCK = /tmp/condor-lock.0.21512031190014

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

CONDOR_HOST = 192.168.2.1

CONDOR_IDS = 503.20

NUM_CPUS = 15

SEC_DEFAULT_NEGOTIATION = OPTIONAL

HOSTALLOW_WRITE = 192.168.2.*

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Ok. We need a bit more information in order to figure out what is happening. Lets start with the basics.

Are you running 8.8.1 on all of the nodes?

Does the output of condor_status, show all of your execute nodes?

Does condor_q show your jobs?

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine. run

condor_config_val shadow_log

on the submit node to find out where that is. You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

Hi all,

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

Thanx

steve

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

NIPR: scupton@xxxxxxx

SIPR: uptonsc@xxxxxxxxxxxxxxxxx

SEED Center website: https://harvest.nps.edu

Mailing List Archives

Public Access

Re: [HTCondor-users] Condor running slow (q,status,submit)