[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor running slow (q,status,submit)



Thanks for responding!


Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of  condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.


I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.


Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33    /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <>

03/26/19 06:58:33 DaemonCore: private command socket at <>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108



Here is a copy of my condor_config.local as well.


##  Where have you installed the bin, sbin and lib condor directories?


RELEASE_DIR = /usr/local/condor





##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

##  this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu




LOCAL_DIR = /usr/local/condor/local.reaper


UID_DOMAIN = reaper.localnet


JAVA = /usr/bin/java


CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx


MAIL = /usr/bin/mail


FILESYSTEM_DOMAIN = reaper.localnet


LOCK = /tmp/condor-lock.0.21512031190014








CONDOR_IDS = 503.20






HOSTALLOW_WRITE = 192.168.2.*



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)


Ok.  We need a bit more information in order to figure out what is happening.  Lets start with the basics.


Are you running 8.8.1 on all of the nodes?

Does the output of  condor_status, show all of your execute nodes?

Does condor_q show your jobs?


if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine.   run


condor_config_val shadow_log


on the submit node to find out where that is.  You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.





From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)


Hi all,


HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.





Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

NIPR: scupton@xxxxxxx

SIPR: uptonsc@xxxxxxxxxxxxxxxxx

SEED Center website: https://harvest.nps.edu