[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Cannot register HTCondor node by condor_restart



Hi,

Occasionally, I am seeing problems in our cluster where a node or two drop out and I am unable to reconnect the node into the pool by condor_restart. This is what I see in the CollectorLog in my condor host when I issued a condor_restart to one of the dropped out nodes (192.168.56.104, or srv03.hpc-dev.spookfish.com):

10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:51449>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:51449> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:35 Got INVALIDATE_SCHEDD_ADS
10/20/16 08:02:35 **** Removed(1) ad(s): "< srv03.hpc-dev.spookfish.com , 192.168.56.104 >"
10/20/16 08:02:35 (Invalidated 1 ads)
10/20/16 08:02:35 In OfflineCollectorPlugin::update ( 14 )
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:37344>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:37344> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:59649>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:59649> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:39 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:02:39 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:02:39 In OfflineCollectorPlugin::update ( 0 )
10/20/16 08:02:39 Registered TCP socket from <192.168.56.104:34800> for updates.
10/20/16 08:02:40 MasterAd     : Updating ... "< srv03.hpc-dev.spookfish.com >"
10/20/16 08:02:40 In OfflineCollectorPlugin::update ( 2 )
10/20/16 08:02:40 Registered TCP socket from <192.168.56.104:52630> for updates.
10/20/16 08:03:00 Got QUERY_STARTD_PVT_ADS
10/20/16 08:03:00 ForkWorker::Fork: New child of 14255 = 14459
10/20/16 08:03:00 Number of Active Workers 0
10/20/16 08:03:00 (Sending 4 ads in response to query)
10/20/16 08:03:00 Query info: matched=4; skipped=0; query_time=0.000969; send_time=0.000492; type=MachinePrivate; requirements={true}; peer=<192.168.56.100:51532>; projection={}
10/20/16 08:03:00 ForkWork: Child 14459 done, status 0
10/20/16 08:03:00 DaemonCore: No more children processes to reap.
10/20/16 08:03:00 Got QUERY_ANY_ADS
10/20/16 08:03:00 ForkWorker::Fork: New child of 14255 = 14460
10/20/16 08:03:00 Number of Active Workers 0
10/20/16 08:03:00 (Sending 7 ads in response to query)
10/20/16 08:03:00 Query info: matched=7; skipped=7; query_time=0.001015; send_time=0.002891; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<192.168.56.100:42004>; projection={}
10/20/16 08:03:00 ForkWork: Child 14460 done, status 0
10/20/16 08:03:00 DaemonCore: No more children processes to reap.
10/20/16 08:03:02 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:03:02 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:03:02 In OfflineCollectorPlugin::update ( 0 )

Sometimes, I can make the node reconnect by killing all the condor processes, then restarting condor_master on that node.

Whatâs going on here?

Many thanks for anyoneâs help.

Kind Regards
Jason

PRIVACY AND CONFIDENTIALITY NOTICE
The information contained in this message is intended for the named recipients only. It may contain confidential information and if you are not the intended recipient, you must not copy, distribute or take any action in reliance on it. If you have received this message in error please destroy it and reply to the sender immediately or contact us at the above telephone number.
VIRUS DISCLAIMER
While we take every precaution against presence of computer viruses on our system, we accept no responsibility for loss or damage arising from the transmission of viruses to e-mail recipients.