[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor stopping on windows execute nodes - DC_AUTHENTICATE errors



Hi All

Our pool/s of windows 10 execute nodes are running HTCondor version 8.6.12 (32-bit). Our windows server 2016 submit nodes
are running HTCondor version 8.6.13 (32-bit). Our ubuntu1604 central managers are running HTCondor version 8.6.10 (64-bit).

When jobs are running we are getting many of them shutting down the HTCondor daemons on the execute nodes.

Below is an excerpt from the MasterLog of one of them. I have redacted the actual IP but FYI the IP is the actual execute node itself.
i.e. it seems to be having trouble talking to itself?

Any help/suggestions appreciated.

Cheers

Greg


04/01/20 14:12:02 DC_AUTHENTICATE: attempt to open invalid session 5aa8a4fbe57292c8f531046e5b9e181aa7425f2482f934c3, failing; this session was requested by <xxx.xxx.xxx.222:53729> with return address <xxx.xxx.xxx.222:9309?addrs=xxx.xxx.xxx.222-9309>
04/01/20 14:12:02 DCMessenger::startCommand(DC_INVALIDATE_KEY,...) making non-blocking connection to <xxx.xxx.xxx.222:9309?addrs=xxx.xxx.xxx.222-9309>
04/01/20 14:12:02 Calling Handler <SecManStartCommand::WaitForSocketCallback DC_INVALIDATE_KEY> (2)
04/01/20 14:12:02 Return from Handler <SecManStartCommand::WaitForSocketCallback DC_INVALIDATE_KEY> 0.000492s
04/01/20 14:12:02 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
04/01/20 14:12:02 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.000873s
04/01/20 14:12:02 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
04/01/20 14:12:02 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.002681s
04/01/20 14:12:02 Calling HandleReq <HandleSigCommand()> (0) for command 60000 (DC_RAISESIGNAL) from SYSTEM <xxx.xxx.xxx.222:58267>
04/01/20 14:12:02 Return from HandleReq <HandleSigCommand()> (handler: 0.000010s, sec: 0.000s, payload: 0.000s)
04/01/20 14:12:02 Got SIGTERM. Performing graceful shutdown.
04/01/20 14:12:02 Daemon::startCommand(DC_RAISESIGNAL,...) making connection to <xxx.xxx.xxx.222:9309>
04/01/20 14:12:02 Sent signal 15 to STARTD (pid 12248)
04/01/20 14:12:02 Calling HandleReq <HandleSigCommand()> (0) for command 60000 (DC_RAISESIGNAL) from SYSTEM <xxx.xxx.xxx.222:58275>
04/01/20 14:12:02 Return from HandleReq <HandleSigCommand()> (handler: 0.000002s, sec: 0.000s, payload: 0.000s)
04/01/20 14:12:06 Calling HandleReq <HandleSigCommand()> (0) for command 60000 (DC_RAISESIGNAL) from SYSTEM <xxx.xxx.xxx.222:49594>
04/01/20 14:12:06 Return from HandleReq <HandleSigCommand()> (handler: 0.000003s, sec: 0.000s, payload: 0.000s)
04/01/20 14:12:06 Got SIGQUIT.  Performing fast shutdown.
04/01/20 14:12:06 Timeout for graceful shutdown has expired for STARTD.
04/01/20 14:12:06 Daemon::startCommand(DC_RAISESIGNAL,...) making connection to <xxx.xxx.xxx.222:9309>
04/01/20 14:12:06 Sent signal 3 to STARTD (pid 12248)
04/01/20 14:12:32 DaemonCore: pid 12248 exited with status 0, invoking reaper 1 <Daemons::AllReaper()>
04/01/20 14:12:32 AllReaper unexpectedly called on pid 12248, status 0.
04/01/20 14:12:32 The STARTD (pid 12248) exited with status 0
04/01/20 14:12:32 All daemons are gone.  Exiting.
04/01/20 14:12:32 **** Condor (condor_MASTER) pid 16260 EXITING WITH STATUS 0