[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to locate startd - Can't find address for startd



On 9/9/2019 10:03 AM, Justin Fisher wrote:
> Hi.
> 
> I'm wondering if anyone can help me, please? My condor_status comes up 
> empty. Here is some relevant information:
> 

Hi Justin,

It would help if you told us what you are trying to do...  

Assuming you are trying to setup a two-server pool where one machine is a worker node, and the other machine is the central manager + submit node, some pithy suggestions:

0. Install via the RPMs as shown here:

   https://research.cs.wisc.edu/htcondor/instructions/el/7/stable/
  
1. Do not make any edits to /etc/condor/condor_config

2. Have /etc/hosts properly setup (or a DNS config) for both forward and reverse address lookups if you want to use host names instead of addresses, for instance include the following on each node /etc/hosts file:

   192.168.1.206 manager.jfisher.ingenazure.com manager
   192.168.1.207 worker1.jfisher.ingenazure.com worker1

2. On your worker node(s).... in a file in /etc/condor/config.d have something like 

   DAEMON_LIST = MASTER STARTD
   CONDOR_HOST = manager.jfisher.ingenazure.com  
   ALLOW_WRITE = 192.168.1.*

3. On your central manager plus submit node.... in a file in /etc/condor/config.d have something like 

   DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR STARTD
   CONDOR_HOST = manager.jfisher.ingenazure.com  
   ALLOW_WRITE = 192.168.1.*

You probably want to tighten down the ALLOW_WRITE above if you do not trust all hosts on 192.168.1.* 
network... the above is just something to help you get going, not a production config!

Besides the Manual, this (old) blog post may help you a bit:
  https://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/

Also, I see "Azure" in your host... perhaps of interest, if you are using Azure, the Azure CycleCloud 
tool supports spinning up an HTCondor pool in Azure with some points-and-clicks.  See

  https://docs.microsoft.com/en-us/azure/cyclecloud/overview

Hope the above helps
Todd
  



> --
> Kind regards,
> 
> Justin Fisher
> 
> Centos 7 - all machines
> SeLinux is disabled on all machines
> 
> condor_version
> $CondorVersion: 8.8.5 Sep 04 2019 BuildID: 480168 PackageID: 8.8.5-1 $
> $CondorPlatform: x86_64_RedHat7 $
> 
> 
> Worker machine:
> 
> DAEMON_LIST = MASTER, COLLECTOR, STARTD, SCHEDD, SHARED_PORT
> 
> sudo systemctl status condor.service
> â condor.service - Condor Distributed High-Throughput-Computing
>  Â ÂLoaded: loaded (/usr/lib/systemd/system/condor.service; enabled; 
> vendor preset: disabled)
>  Â ÂActive: active (running) since Mon 2019-09-09 13:23:42 CEST; 1h 
> 19min ago
>  ÂMain PID: 5976 (condor_master)
>  Â ÂStatus: "All daemons are responding"
>  Â ÂMemory: 26.1M
>  Â ÂCGroup: /system.slice/condor.service
>  Â Â Â Â Â Âââ5976 /usr/sbin/condor_master -f
>  Â Â Â Â Â Âââ6206 condor_procd -A /var/run/condor/procd_pipe -L 
> /var/log/condor/ProcLog -R 1000000 -S 60 -C 983
>  Â Â Â Â Â Âââ6207 condor_shared_port -f -p 9618
>  Â Â Â Â Â Âââ6216 condor_collector -f
>  Â Â Â Â Â Âââ6217 condor_startd -f
>  Â Â Â Â Â Âââ6218 condor_schedd -f
> 
> 
> ps -ef | grep condor
> condor   15912    1 Â0 15:30 ?    Â00:00:00 
> /usr/sbin/condor_master -f
> root    16032  15912 Â0 15:30 ?    Â00:00:00 condor_procd -A 
> /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 
> -C 983
> condor   16033  15912 Â0 15:30 ?    Â00:00:00 condor_shared_port 
> -f -p 9618
> condor   16034  15912 Â0 15:30 ?    Â00:00:00 condor_collector -f
> condor   16035  15912 Â0 15:30 ?    Â00:00:00 condor_startd -f
> condor   16036  15912 Â0 15:30 ?    Â00:00:00 condor_schedd -f
> jfisher  Â19755  Â8348 Â0 16:34 pts/0  Â00:00:00 grep --color=auto condor
> 
> 
> condor_status
> Error: communication error
> CONDOR_STATUS:1:Unable to resolve COLLECTOR_HOST (:9618).
> 
> condor_status -direct 192.168.1.206
> Error: Failed to locate startd 192.168.1.206
> 
> 
> 
> 
> 
> Master Machine
> 
> DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT
> 
> condor_status -direct 192.168.1.206
> Error: Failed to locate startd 192.168.1.206
> Can't find address for startd 192.168.1.206
> 
> 
> ps -ef | grep condor
> condor   42828    1 Â0 16:19 ?    Â00:00:00 
> /usr/sbin/condor_master -f
> root    42881  42828 Â0 16:19 ?    Â00:00:00 condor_procd -A 
> /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 
> -C 979
> condor   42882  42828 Â0 16:19 ?    Â00:00:00 condor_shared_port 
> -f -p 9618
> condor   42885  42828 Â0 16:19 ?    Â00:00:00 condor_negotiator -f
> condor   42886  42828 Â0 16:19 ?    Â00:00:00 condor_schedd -f
> condor   43387  42828 Â0 16:27 ?    Â00:00:00 condor_collector -f
> jfisher  Â43637  43334 Â0 16:30 pts/0  Â00:00:00 grep --color=auto condor
> 
> CollectorLog
> 
> 09/09/19 16:44:11 Now in new log file /var/log/condor/CollectorLog
> 09/09/19 16:44:11 Enabling CCB Server.
> 09/09/19 16:44:11 m_reconnect_fname = 
> /var/lib/condor/spool/192.168.1.206-9618.ccb_reconnect
> 09/09/19 16:44:11 Configuration: SAMPLING_INTERVAL=60, 
> MAX_STORAGE=10000000, MaxFileSize=333333, POOL_HISTORY_DIR=/var/ViewHist
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.0.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.0.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.1.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.1.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.2.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist0.2.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.0.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.0.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.1.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.1.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.2.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist1.2.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.0.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.0.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.1.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.1.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.2.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist2.2.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.0.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.0.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.1.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.1.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.2.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist3.2.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.0.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.0.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.1.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.1.new , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.2.old , StartTime=-1
> 09/09/19 16:44:11 FileName=/var/ViewHist/viewhist4.2.new , StartTime=-1
> 09/09/19 16:44:11 DC_AUTHENTICATE: attempt to open invalid session 
> jfisher:43847:1568040061:3, failing; this session was requested by 
> <192.168.1.206:11115 <http://192.168.1.206:11115>> with return address 
> <192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631 
> <http://192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631>>
> 09/09/19 16:44:12 CollectorAd Â: Inserting ** "< My Pool - 
> jfisher.ingenazure.com@xxxxxxxxxxxxxxxxxxxxxx 
> <mailto:jfisher.ingenazure.com@xxxxxxxxxxxxxxxxxxxxxx> >"
> 09/09/19 16:44:42 DC_AUTHENTICATE: attempt to open invalid session 
> jfisher:43847:1568039862:1, failing; this session was requested by 
> <192.168.1.206:7312 <http://192.168.1.206:7312>> with return address 
> <192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_3 
> <http://192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_3>>
> 09/09/19 16:44:43 Got QUERY_STARTD_PVT_ADS
> 09/09/19 16:44:43 QueryWorker: forked new high priority worker with id 
> 44616 ( max 4 active 1 pending 0 )
> 09/09/19 16:44:43 (Sending 0 ads in response to query)
> 09/09/19 16:44:43 Query info: matched=0; skipped=0; query_time=0.000164; 
> send_time=0.000062; type=MachinePrivate; requirements={true}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:27041 
> <http://192.168.1.206:27041>>; projection={}
> 09/09/19 16:44:43 QueryWorker: forked new high priority worker with id 
> 44617 ( max 4 active 1 pending 0 )
> 09/09/19 16:44:43 (Sending 0 ads in response to query)
> 09/09/19 16:44:43 Query info: matched=0; skipped=1; query_time=0.000231; 
> send_time=0.000055; type=Any; requirements={(((MyType == "Scheduler") || 
> (MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:12622 
> <http://192.168.1.206:12622>>; projection={}
> 09/09/19 16:44:43 AccountingAd Â: Inserting ** "< 
> <none>jfisher.ingenazure.com <http://jfisher.ingenazure.com> >"
> 09/09/19 16:44:51 ScheddAd   : Inserting ** "< jfisher.ingenazure.com 
> <http://jfisher.ingenazure.com> , 192.168.1.206 >"
> 09/09/19 16:45:43 Got QUERY_STARTD_PVT_ADS
> 09/09/19 16:45:43 QueryWorker: forked new high priority worker with id 
> 44656 ( max 4 active 1 pending 0 )
> 09/09/19 16:45:43 (Sending 0 ads in response to query)
> 09/09/19 16:45:43 Query info: matched=0; skipped=0; query_time=0.000159; 
> send_time=0.000055; type=MachinePrivate; requirements={true}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:17677 
> <http://192.168.1.206:17677>>; projection={}
> 09/09/19 16:45:43 QueryWorker: forked new high priority worker with id 
> 44657 ( max 4 active 1 pending 0 )
> 09/09/19 16:45:43 (Sending 1 ads in response to query)
> 09/09/19 16:45:43 Query info: matched=1; skipped=2; query_time=0.000274; 
> send_time=0.000407; type=Any; requirements={(((MyType == "Scheduler") || 
> (MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:31536 
> <http://192.168.1.206:31536>>; projection={}
> 09/09/19 16:46:43 Got QUERY_STARTD_PVT_ADS
> 09/09/19 16:46:43 QueryWorker: forked new high priority worker with id 
> 44675 ( max 4 active 1 pending 0 )
> 09/09/19 16:46:43 (Sending 0 ads in response to query)
> 09/09/19 16:46:43 Query info: matched=0; skipped=0; query_time=0.000190; 
> send_time=0.000059; type=MachinePrivate; requirements={true}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:1767 
> <http://192.168.1.206:1767>>; projection={}
> 09/09/19 16:46:43 QueryWorker: forked new high priority worker with id 
> 44676 ( max 4 active 1 pending 0 )
> 09/09/19 16:46:43 (Sending 1 ads in response to query)
> 09/09/19 16:46:43 Query info: matched=1; skipped=2; query_time=0.000244; 
> send_time=0.000417; type=Any; requirements={(((MyType == "Scheduler") || 
> (MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:31281 
> <http://192.168.1.206:31281>>; projection={}
> 09/09/19 16:47:43 Got QUERY_STARTD_PVT_ADS
> 09/09/19 16:47:43 QueryWorker: forked new high priority worker with id 
> 44696 ( max 4 active 1 pending 0 )
> 09/09/19 16:47:43 (Sending 0 ads in response to query)
> 09/09/19 16:47:43 Query info: matched=0; skipped=0; query_time=0.000212; 
> send_time=0.000055; type=MachinePrivate; requirements={true}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:27982 
> <http://192.168.1.206:27982>>; projection={}
> 09/09/19 16:47:43 QueryWorker: forked new high priority worker with id 
> 44697 ( max 4 active 1 pending 0 )
> 09/09/19 16:47:43 (Sending 1 ads in response to query)
> 09/09/19 16:47:43 Query info: matched=1; skipped=2; query_time=0.000464; 
> send_time=0.000845; type=Any; requirements={(((MyType == "Scheduler") || 
> (MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:9918 
> <http://192.168.1.206:9918>>; projection={}
> 09/09/19 16:48:43 Got QUERY_STARTD_PVT_ADS
> 09/09/19 16:48:43 QueryWorker: forked new high priority worker with id 
> 44724 ( max 4 active 1 pending 0 )
> 09/09/19 16:48:43 (Sending 0 ads in response to query)
> 09/09/19 16:48:43 Query info: matched=0; skipped=0; query_time=0.000227; 
> send_time=0.000070; type=MachinePrivate; requirements={true}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:24285 
> <http://192.168.1.206:24285>>; projection={}
> 09/09/19 16:48:43 QueryWorker: forked new high priority worker with id 
> 44725 ( max 4 active 1 pending 0 )
> 09/09/19 16:48:43 (Sending 1 ads in response to query)
> 09/09/19 16:48:43 Query info: matched=1; skipped=2; query_time=0.000307; 
> send_time=0.000718; type=Any; requirements={(((MyType == "Scheduler") || 
> (MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; 
> limit=0; from=NEGOTIATOR; peer=<192.168.1.206:22064 
> <http://192.168.1.206:22064>>; projection={}
> 
> 
> MasterLog
> 09/09/19 16:17:55 ******************************************************
> 09/09/19 16:17:55 ** condor_master (CONDOR_MASTER) STARTING UP
> 09/09/19 16:17:55 ** /usr/sbin/condor_master
> 09/09/19 16:17:55 ** SubsystemInfo: name=MASTER type=MASTER(2) 
> class=DAEMON(1)
> 09/09/19 16:17:55 ** Configuration: subsystem:MASTER local:<NONE> 
> class:DAEMON
> 09/09/19 16:17:55 ** $CondorVersion: 8.8.5 Sep 04 2019 BuildID: 480168 
> PackageID: 8.8.5-1 $
> 09/09/19 16:17:55 ** $CondorPlatform: x86_64_RedHat7 $
> 09/09/19 16:17:55 ** PID = 42671
> 09/09/19 16:17:55 ** Log last touched 9/9 16:17:50
> 09/09/19 16:17:55 ******************************************************
> 09/09/19 16:17:55 Using config source: /etc/condor/condor_config
> 09/09/19 16:17:55 Using local config sources:
> 09/09/19 16:17:55 Â Â/etc/condor/config.d/00master.config
> 09/09/19 16:17:55 Â Â/etc/condor/condor_config.local
> 09/09/19 16:17:55 config Macros = 77, Sorted = 77, StringBytes = 1998, 
> TablesBytes = 2820
> 09/09/19 16:17:55 CLASSAD_CACHING is OFF
> 09/09/19 16:17:55 Daemon Log is logging: D_ALWAYS D_ERROR
> 09/09/19 16:17:56 Removed /var/lock/condor/shared_port_ad (assuming it 
> is left over from previous run)
> 09/09/19 16:17:56 SharedPortEndpoint: waiting for connections to named 
> socket 42671_55d3
> 09/09/19 16:17:56 SharedPortEndpoint: failed to open 
> /var/lock/condor/shared_port_ad: No such file or directory
> 09/09/19 16:17:56 SharedPortEndpoint: did not successfully find 
> SharedPortServer address. Will retry in 60s.
> 09/09/19 16:17:56 DaemonCore: private command socket at 
> <192.168.1.206:0?sock=42671_55d3 <http://192.168.1.206:0?sock=42671_55d3>>
> 09/09/19 16:17:56 Master restart (GRACEFUL) is watching 
> /usr/sbin/condor_master (mtime:1567657410)
> 09/09/19 16:17:56 Started DaemonCore process 
> "/usr/libexec/condor/condor_shared_port", pid and pgroup = 42721
> 09/09/19 16:17:56 Waiting for /var/lock/condor/shared_port_ad to appear.
> 09/09/19 16:17:57 Found /var/lock/condor/shared_port_ad.
> 09/09/19 16:17:57 Started DaemonCore process 
> "/usr/sbin/condor_collector", pid and pgroup = 42722
> 09/09/19 16:17:57 Waiting for /var/log/condor/.collector_address to appear.
> 09/09/19 16:17:58 Found /var/log/condor/.collector_address.
> 09/09/19 16:17:58 Started DaemonCore process 
> "/usr/sbin/condor_negotiator", pid and pgroup = 42723
> 09/09/19 16:17:58 Started DaemonCore process "/usr/sbin/condor_schedd", 
> pid and pgroup = 42724
> 09/09/19 16:19:39 Got SIGQUIT. Performing fast shutdown.
> 09/09/19 16:19:39 Sent SIGQUIT to COLLECTOR (pid 42722)
> 09/09/19 16:19:39 Sent SIGQUIT to NEGOTIATOR (pid 42723)
> 09/09/19 16:19:39 Sent SIGQUIT to SCHEDD (pid 42724)
> 09/09/19 16:19:39 AllReaper unexpectedly called on pid 42723, status 0.
> 09/09/19 16:19:39 The NEGOTIATOR (pid 42723) exited with status 0
> 09/09/19 16:19:39 AllReaper unexpectedly called on pid 42722, status 0.
> 09/09/19 16:19:39 The COLLECTOR (pid 42722) exited with status 0
> 09/09/19 16:19:39 AllReaper unexpectedly called on pid 42724, status 0.
> 09/09/19 16:19:39 The SCHEDD (pid 42724) exited with status 0
> 09/09/19 16:19:39 Sent SIGTERM to SHARED_PORT (pid 42721)
> 09/09/19 16:19:39 AllReaper unexpectedly called on pid 42721, status 0.
> 09/09/19 16:19:39 The SHARED_PORT (pid 42721) exited with status 0
> 09/09/19 16:19:39 About to tell the ProcD to exit
> 09/09/19 16:19:39 procd (pid = 42720) exited with status 0
> 09/09/19 16:19:39 All daemons are gone. Exiting.
> 09/09/19 16:19:39 **** condor_master (condor_MASTER) pid 42671 EXITING 
> WITH STATUS 0
> 09/09/19 16:19:39 ******************************************************
> 09/09/19 16:19:39 ** condor_master (CONDOR_MASTER) STARTING UP
> 09/09/19 16:19:39 ** /usr/sbin/condor_master
> 09/09/19 16:19:39 ** SubsystemInfo: name=MASTER type=MASTER(2) 
> class=DAEMON(1)
> 09/09/19 16:19:39 ** Configuration: subsystem:MASTER local:<NONE> 
> class:DAEMON
> 09/09/19 16:19:39 ** $CondorVersion: 8.8.5 Sep 04 2019 BuildID: 480168 
> PackageID: 8.8.5-1 $
> 09/09/19 16:19:39 ** $CondorPlatform: x86_64_RedHat7 $
> 09/09/19 16:19:39 ** PID = 42828
> 09/09/19 16:19:39 ** Log last touched 9/9 16:19:39
> 09/09/19 16:19:39 ******************************************************
> 09/09/19 16:19:39 Using config source: /etc/condor/condor_config
> 09/09/19 16:19:39 Using local config sources:
> 09/09/19 16:19:39 Â Â/etc/condor/config.d/00master.config
> 09/09/19 16:19:39 Â Â/etc/condor/condor_config.local
> 09/09/19 16:19:39 config Macros = 77, Sorted = 77, StringBytes = 2000, 
> TablesBytes = 2820
> 09/09/19 16:19:39 CLASSAD_CACHING is OFF
> 09/09/19 16:19:39 Daemon Log is logging: D_ALWAYS D_ERROR
> 09/09/19 16:19:40 SharedPortEndpoint: waiting for connections to named 
> socket 42828_0631
> 09/09/19 16:19:40 SharedPortEndpoint: failed to open 
> /var/lock/condor/shared_port_ad: No such file or directory
> 09/09/19 16:19:40 SharedPortEndpoint: did not successfully find 
> SharedPortServer address. Will retry in 60s.
> 09/09/19 16:19:40 DaemonCore: private command socket at 
> <192.168.1.206:0?sock=42828_0631 <http://192.168.1.206:0?sock=42828_0631>>
> 09/09/19 16:19:40 Master restart (GRACEFUL) is watching 
> /usr/sbin/condor_master (mtime:1567657410)
> 09/09/19 16:19:40 Started DaemonCore process 
> "/usr/libexec/condor/condor_shared_port", pid and pgroup = 42882
> 09/09/19 16:19:40 Waiting for /var/lock/condor/shared_port_ad to appear.
> 09/09/19 16:19:41 Found /var/lock/condor/shared_port_ad.
> 09/09/19 16:19:41 Started DaemonCore process 
> "/usr/sbin/condor_collector", pid and pgroup = 42883
> 09/09/19 16:19:41 Waiting for /var/log/condor/.collector_address to appear.
> 09/09/19 16:19:42 Found /var/log/condor/.collector_address.
> 09/09/19 16:19:42 Started DaemonCore process 
> "/usr/sbin/condor_negotiator", pid and pgroup = 42885
> 09/09/19 16:19:42 Started DaemonCore process "/usr/sbin/condor_schedd", 
> pid and pgroup = 42886
> 09/09/19 16:27:41 DefaultReaper unexpectedly called on pid 42883, status 
> 1024.
> 09/09/19 16:27:41 The COLLECTOR (pid 42883) exited with status 4
> 09/09/19 16:27:41 Sending obituary for "/usr/sbin/condor_collector"
> 09/09/19 16:27:41 restarting /usr/sbin/condor_collector in 10 seconds
> 09/09/19 16:27:41 condor_write(): Socket closed when trying to write 
> 1439 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10
> 09/09/19 16:27:41 Buf::write(): condor_write() failed
> 09/09/19 16:27:51 Started DaemonCore process 
> "/usr/sbin/condor_collector", pid and pgroup = 43387
> 09/09/19 16:27:51 condor_write(): Socket closed when trying to write 
> 1448 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:27:51 Buf::write(): condor_write() failed
> 09/09/19 16:32:51 condor_write(): Socket closed when trying to write 
> 1465 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:32:51 Buf::write(): condor_write() failed
> 09/09/19 16:35:51 DefaultReaper unexpectedly called on pid 43387, status 
> 1024.
> 09/09/19 16:35:51 The COLLECTOR (pid 43387) exited with status 4
> 09/09/19 16:35:51 Sending obituary for "/usr/sbin/condor_collector"
> 09/09/19 16:35:51 restarting /usr/sbin/condor_collector in 10 seconds
> 09/09/19 16:35:51 condor_write(): Socket closed when trying to write 
> 1439 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10
> 09/09/19 16:35:51 Buf::write(): condor_write() failed
> 09/09/19 16:36:01 Started DaemonCore process 
> "/usr/sbin/condor_collector", pid and pgroup = 43847
> 09/09/19 16:36:01 condor_write(): Socket closed when trying to write 
> 1448 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:36:01 Buf::write(): condor_write() failed
> 09/09/19 16:41:01 condor_write(): Socket closed when trying to write 
> 1465 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:41:01 Buf::write(): condor_write() failed
> 09/09/19 16:44:01 DefaultReaper unexpectedly called on pid 43847, status 
> 1024.
> 09/09/19 16:44:01 The COLLECTOR (pid 43847) exited with status 4
> 09/09/19 16:44:01 Sending obituary for "/usr/sbin/condor_collector"
> 09/09/19 16:44:01 restarting /usr/sbin/condor_collector in 10 seconds
> 09/09/19 16:44:01 condor_write(): Socket closed when trying to write 
> 1456 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10
> 09/09/19 16:44:01 Buf::write(): condor_write() failed
> 09/09/19 16:44:11 Started DaemonCore process 
> "/usr/sbin/condor_collector", pid and pgroup = 44606
> 09/09/19 16:44:11 condor_write(): Socket closed when trying to write 
> 1466 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:44:11 Buf::write(): condor_write() failed
> 09/09/19 16:49:11 condor_write(): Socket closed when trying to write 
> 1467 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10, errno=104 Connection 
> reset by peer
> 09/09/19 16:49:11 Buf::write(): condor_write() failed
> 09/09/19 16:52:11 DefaultReaper unexpectedly called on pid 44606, status 
> 1024.
> 09/09/19 16:52:11 The COLLECTOR (pid 44606) exited with status 4
> 09/09/19 16:52:11 Sending obituary for "/usr/sbin/condor_collector"
> 09/09/19 16:52:11 restarting /usr/sbin/condor_collector in 10 seconds
> 09/09/19 16:52:11 condor_write(): Socket closed when trying to write 
> 1458 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 10
> 09/09/19 16:52:11 Buf::write(): condor_write() failed
> 
> NegotiatorLog
> 09/09/19 16:23:42 ---------- Started Negotiation Cycle ----------
> 09/09/19 16:23:42 Phase 1: ÂObtaining ads from collector ...
> 09/09/19 16:23:42 Â Getting startd private ads ...
> 09/09/19 16:23:42 Â Getting Scheduler, Submitter and Machine ads ...
> 09/09/19 16:23:42 Â Sorting 1 ads ...
> 09/09/19 16:23:42 Got ads: 1 public and 0 private
> 09/09/19 16:23:42 Public ads include 0 submitter, 0 startd
> 09/09/19 16:23:42 Phase 2: ÂPerforming accounting ...
> 09/09/19 16:23:42 Phase 3: ÂSorting submitter ads by priority ...
> 09/09/19 16:23:42 Starting prefetch round; 0 potential prefetches to do.
> 09/09/19 16:23:42 Prefetch summary: 0 attempted, 0 successful.
> 09/09/19 16:23:42 Phase 4.1: ÂNegotiating with schedds ...
> 09/09/19 16:23:42 ÂnegotiateWithGroup resources used submitterAds length 0
> 09/09/19 16:23:42 ---------- Finished Negotiation Cycle ----------
> 
> ProcLog
> 09/09/19 16:22:40 : taking a snapshot...
> 09/09/19 16:22:40 : ProcAPI: new boottime = 1568017676; old_boottime = 
> 1568017676; /proc/stat boottime = 1568017676; /proc/uptime boottime = 
> 1568017676
> 09/09/19 16:22:40 : process 42958 (not in monitored family) has exited
> 09/09/19 16:22:40 : process 42809 (not in monitored family) has exited
> 09/09/19 16:22:40 : process 42808 (not in monitored family) has exited
> 09/09/19 16:22:40 : process 40467 (not in monitored family) has exited
> 09/09/19 16:22:40 : no methods have determined process 42997 to be in a 
> monitored family
> 09/09/19 16:22:40 : no methods have determined process 43007 to be in a 
> monitored family
> 09/09/19 16:22:40 : ...snapshot complete
> 09/09/19 16:23:40 : taking a snapshot...
> 09/09/19 16:23:40 : ProcAPI: new boottime = 1568017676; old_boottime = 
> 1568017676; /proc/stat boottime = 1568017676; /proc/uptime boottime = 
> 1568017676
> 09/09/19 16:23:40 : process 43007 (not in monitored family) has exited
> 09/09/19 16:23:40 : no methods have determined process 43055 to be in a 
> monitored family
> 09/09/19 16:23:40 : ...snapshot complete
> 
> SchedLog
> 09/09/19 16:19:42 (pid:42886) 
> ******************************************************
> 09/09/19 16:19:42 (pid:42886) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 09/09/19 16:19:42 (pid:42886) ** /usr/sbin/condor_schedd
> 09/09/19 16:19:42 (pid:42886) ** SubsystemInfo: name=SCHEDD 
> type=SCHEDD(5) class=DAEMON(1)
> 09/09/19 16:19:42 (pid:42886) ** Configuration: subsystem:SCHEDD 
> local:<NONE> class:DAEMON
> 09/09/19 16:19:42 (pid:42886) ** $CondorVersion: 8.8.5 Sep 04 2019 
> BuildID: 480168 PackageID: 8.8.5-1 $
> 09/09/19 16:19:42 (pid:42886) ** $CondorPlatform: x86_64_RedHat7 $
> 09/09/19 16:19:42 (pid:42886) ** PID = 42886
> 09/09/19 16:19:42 (pid:42886) ** Log last touched 9/9 16:19:39
> 09/09/19 16:19:42 (pid:42886) 
> ******************************************************
> 09/09/19 16:19:42 (pid:42886) Using config source: /etc/condor/condor_config
> 09/09/19 16:19:42 (pid:42886) Using local config sources:
> 09/09/19 16:19:42 (pid:42886) Â Â/etc/condor/config.d/00master.config
> 09/09/19 16:19:42 (pid:42886) Â Â/etc/condor/condor_config.local
> 09/09/19 16:19:42 (pid:42886) config Macros = 78, Sorted = 78, 
> StringBytes = 2046, TablesBytes = 2856
> 09/09/19 16:19:42 (pid:42886) CLASSAD_CACHING is ENABLED
> 09/09/19 16:19:42 (pid:42886) Daemon Log is logging: D_ALWAYS D_ERROR
> 09/09/19 16:19:42 (pid:42886) SharedPortEndpoint: waiting for 
> connections to named socket 42828_0631_4
> 09/09/19 16:19:42 (pid:42886) DaemonCore: command socket at 
> <192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_4 
> <http://192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_4>>
> 09/09/19 16:19:42 (pid:42886) DaemonCore: private command socket at 
> <192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_4 
> <http://192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP&sock=42828_0631_4>>
> 09/09/19 16:19:42 (pid:42886) History file rotation is enabled.
> 09/09/19 16:19:42 (pid:42886) Â Maximum history file size is: 20971520 bytes
> 09/09/19 16:19:42 (pid:42886) Â Number of rotated history files is: 2
> 09/09/19 16:19:42 (pid:42886) Reloading job factories
> 09/09/19 16:19:42 (pid:42886) Loaded 0 job factories, 0 were paused, 0 
> failed to load
> 09/09/19 16:19:48 (pid:42886) TransferQueueManager stats: active 
> up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
> 09/09/19 16:19:48 (pid:42886) TransferQueueManager upload 1m I/O load: 0 
> bytes/s Â0.000 disk load Â0.000 net load
> 09/09/19 16:19:48 (pid:42886) TransferQueueManager download 1m I/O load: 
> 0 bytes/s Â0.000 disk load Â0.000 net load
> 09/09/19 16:29:50 (pid:42886) condor_write(): Socket closed when trying 
> to write 4096 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 14
> 09/09/19 16:29:50 (pid:42886) Buf::write(): condor_write() failed
> 09/09/19 16:34:50 (pid:42886) condor_write(): Socket closed when trying 
> to write 4096 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 14, errno=104 Connection 
> reset by peer
> 09/09/19 16:34:50 (pid:42886) Buf::write(): condor_write() failed
> 09/09/19 16:39:50 (pid:42886) condor_write(): Socket closed when trying 
> to write 4096 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 14
> 09/09/19 16:39:50 (pid:42886) Buf::write(): condor_write() failed
> 09/09/19 16:44:51 (pid:42886) condor_write(): Socket closed when trying 
> to write 4096 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 14, errno=104 Connection 
> reset by peer
> 09/09/19 16:44:51 (pid:42886) Buf::write(): condor_write() failed
> 09/09/19 16:54:52 (pid:42886) condor_write(): Socket closed when trying 
> to write 4096 bytes to collector jfisher.ingenazure.com:9618 
> <http://jfisher.ingenazure.com:9618>, fd is 14
> 09/09/19 16:54:52 (pid:42886) Buf::write(): condor_write() failed
> 
> SharedPortLog
> 09/09/19 16:44:01 SharedPortServer: server was busy, failed to connect 
> collector as requested by <192.168.1.206:10396 
> <http://192.168.1.206:10396>>: primary 
> (f18b97e1577f56b8a325498acd2dde3025439e453b9c2b7f81246ac7f1a621bf/collector): 
> Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): 
> Connection refused (111)
> 09/09/19 16:44:40 About to update statistics in shared_port daemon ad 
> file at /var/lock/condor/shared_port_ad :
> ForkedChildrenPeak = 0
> RequestsBlocked = 4
> ForkedChildrenCurrent = 0
> RequestsSucceeded = 80
> RequestsPendingPeak = 2
> RequestsPendingCurrent = 0
> RequestsFailed = 4
> SharedPortCommandSinfuls = "<192.168.1.206:9618 
> <http://192.168.1.206:9618>>"
> MyAddress = "<192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP 
> <http://192.168.1.206:9618?addrs=192.168.1.206-9618&noUDP>>"
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685