[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Intermittent problems in a new pool installation spanning two subnets



Hi,
I have been able to install 23.2.0 in a new pool spanning two subnets,
but there are intermittent communication problems. I attach the logs
below.
The ep1ext is on a different subnet than that ofthe central manager.
Of course I have reviewed firewall rules to allow this traffic in and
out, but still, there are intermittent errors.


VB


# tail -30 /var/log/condor/MasterLog 
12/04/23 08:19:09 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
12/04/23 08:19:09 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:19:16 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14959
12/04/23 08:19:16 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector htcondor.sel in non-blocking mode, errno=104 Connection reset by peer
12/04/23 08:19:16 SECMAN: Failed to read resume session response classad from server.
12/04/23 08:19:16 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
12/04/23 08:19:16 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:19:19 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14960
12/04/23 08:27:16 The VIEW_SERVER (pid 14959) exited with status 4
12/04/23 08:27:16 Sending obituary for "/usr/sbin/condor_collector"
12/04/23 08:27:16 my_popenv: Failed to exec /usr/bin/mail, errno=2 (No such file or directory)
12/04/23 08:27:16 Failed to launch mailer process: /usr/bin/mail
12/04/23 08:27:16 restarting /usr/sbin/condor_collector in 10 seconds
12/04/23 08:27:19 The COLLECTOR (pid 14960) exited with status 4
12/04/23 08:27:19 Sending obituary for "/usr/sbin/condor_collector"
12/04/23 08:27:19 my_popenv: Failed to exec /usr/bin/mail, errno=2 (No such file or directory)
12/04/23 08:27:19 Failed to launch mailer process: /usr/bin/mail
12/04/23 08:27:19 restarting /usr/sbin/condor_collector in 10 seconds
12/04/23 08:27:19 condor_write(): Socket closed when trying to write 2417 bytes to collector htcondor.sel, fd is 10
12/04/23 08:27:19 Buf::write(): condor_write() failed
12/04/23 08:27:19 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector htcondor.sel in non-blocking mode, errno=104 Connection reset by peer
12/04/23 08:27:19 SECMAN: Failed to read resume session response classad from server.
12/04/23 08:27:19 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
12/04/23 08:27:19 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:27:26 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14990
12/04/23 08:27:26 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector htcondor.sel in non-blocking mode, errno=104 Connection reset by peer
12/04/23 08:27:26 SECMAN: Failed to read resume session response classad from server.
12/04/23 08:27:26 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
12/04/23 08:27:26 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:27:29 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 14991

# tail -30 MasterLog 
12/04/23 07:41:00 ERROR: SECMAN:2004:Server rejected our session id
12/04/23 07:41:00 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 07:51:00 condor_write(): Socket closed when trying to write 2178 bytes to collector htcondor.sel, fd is 10
12/04/23 07:51:00 Buf::write(): condor_write() failed
12/04/23 07:51:00 SECMAN: Server rejected our session id
12/04/23 07:51:00 SECMAN: Invalidating negotiated session rejected by peer
12/04/23 07:51:00 ERROR: SECMAN:2004:Server rejected our session id
12/04/23 07:51:00 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:06:00 condor_write(): Socket closed when trying to write 2160 bytes to collector htcondor.sel, fd is 10
12/04/23 08:06:00 Buf::write(): condor_write() failed
12/04/23 08:06:00 SECMAN: Server rejected our session id
12/04/23 08:06:00 SECMAN: Invalidating negotiated session rejected by peer
12/04/23 08:06:00 ERROR: SECMAN:2004:Server rejected our session id
12/04/23 08:06:00 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:11:00 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector htcondor.sel in non-blocking mode, errno=104 Connection reset by peer
12/04/23 08:11:00 SECMAN: no classad from server, failing
12/04/23 08:11:00 ERROR: SECMAN:2007:Failed to end classad message.
12/04/23 08:11:00 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:21:00 condor_write(): Socket closed when trying to write 2178 bytes to collector htcondor.sel, fd is 10
12/04/23 08:21:00 Buf::write(): condor_write() failed
12/04/23 08:21:00 SECMAN: Server rejected our session id
12/04/23 08:21:00 SECMAN: Invalidating negotiated session rejected by peer
12/04/23 08:21:00 ERROR: SECMAN:2004:Server rejected our session id
12/04/23 08:21:00 Failed to start non-blocking update to <10.10.0.30:9618>.
12/04/23 08:31:00 condor_write(): Socket closed when trying to write 2178 bytes to collector htcondor.sel, fd is 10
12/04/23 08:31:00 Buf::write(): condor_write() failed
12/04/23 08:31:00 SECMAN: Server rejected our session id
12/04/23 08:31:00 SECMAN: Invalidating negotiated session rejected by peer
12/04/23 08:31:00 ERROR: SECMAN:2004:Server rejected our session id
12/04/23 08:31:00 Failed to start non-blocking update to <10.10.0.30:9618>.

# condor_status
Name             OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@xxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 32130  0+11:09:39

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     1     0       0         1       0          0      0        0      0

         Total     1     0       0         1       0          0      0        0      0

# condor_status -master
Name         Version        Cpus   Memory      Uptime    

ep1ext.sel   23.2.0.Package   12    31.4 GB    0+11:10:06
htcondor.sel 23.2.0.Package   24    31.4 GB    0+11:09:47