[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows executor nodes loose communication to Linux CM after its restart




Hi all,

When I restart the Condor Central Manager machine (linux), I loose communication to Windows executor nodes that are in other subnet. Executor nodes on same subnet refresh their communication to the CM.
Login in the executor machine and stopping/starting condor service restores communication. This is not supposed to be the common procedure.
No local firewall is enabled (Windows Firewall is off and Firewall on CM machine is also off).
See MasterLog file attached.

Is there any guideline to avoid loosing the connection?

Thanks, Klaus

This message is intended solely for the use of its addressee and may contain privileged or confidential information. All information contained herein shall be treated as confidential and shall not be disclosed to any third party without Embraer’s prior written approval. If you are not the addressee you should not distribute, copy or file this message. In this case, please notify the sender and destroy its contents immediately.
Esta mensagem é para uso exclusivo de seu destinatário e pode conter informações privilegiadas e confidenciais. Todas as informações aqui contidas devem ser tratadas como confidenciais e não devem ser divulgadas a terceiros sem o prévio consentimento por escrito da Embraer. Se você não é o destinatário não deve distribuir, copiar ou arquivar a mensagem. Neste caso, por favor, notifique o remetente da mesma e destrua imediatamente a mensagem.
5/16 13:28:49 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203
5/16 13:28:49 ******************************************************
5/16 13:28:49 ** Condor (CONDOR_MASTER) STARTING UP
5/16 13:28:49 ** C:\Condor\bin\condor_master.exe
5/16 13:28:49 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
5/16 13:28:49 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
5/16 13:28:49 ** $CondorVersion: 7.2.1 Feb 19 2009 BuildID: 133382 $
5/16 13:28:49 ** $CondorPlatform: INTEL-WINNT50 $
5/16 13:28:49 ** PID = 324
5/16 13:28:49 ** Log last touched 5/16 13:00:18
5/16 13:28:49 ******************************************************
5/16 13:28:49 Using config source: C:\Condor\condor_config
5/16 13:28:49 Using local config sources: 
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\condor_config.1
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\1-start\condor_config.master.PC263439
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\2-main\condor_config.INTEL.WINNT51
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\2-main\condor_config.common
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\3-pool\pc222771\condor_config.pool.pc222771
5/16 13:28:49    \\smbsjk01\grid_env\CONDOR\3-pool\pc222771\PC263439\condor_config.local
5/16 13:28:50 DaemonCore: Command Socket at <10.20.12.1:1028>
5/16 13:28:50 Started DaemonCore process "C:\Condor/bin/condor_schedd.exe", pid and pgroup = 920
5/16 13:28:50 Started DaemonCore process "C:\Condor/bin/condor_startd.exe", pid and pgroup = 536
5/16 14:28:50 Preen pid is 3528
5/16 14:28:51 Child 3528 died, but not a daemon -- Ignored
5/17 14:28:50 Preen pid is 2052
5/17 14:28:51 Child 2052 died, but not a daemon -- Ignored
5/18 14:28:50 Preen pid is 3788
5/18 14:28:51 Child 3788 died, but not a daemon -- Ignored
5/19 14:28:50 Preen pid is 488
5/19 14:28:51 Child 488 died, but not a daemon -- Ignored
5/20 14:28:50 Preen pid is 2716
5/20 14:28:51 Child 2716 died, but not a daemon -- Ignored
5/21 10:54:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:54873>.
5/21 10:54:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:54873>.
5/21 10:54:16 IO: Failed to read packet header
5/21 10:54:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 10:59:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:56031>.
5/21 10:59:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:56031>.
5/21 10:59:16 IO: Failed to read packet header
5/21 10:59:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:04:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:41212>.
5/21 11:04:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:41212>.
5/21 11:04:16 IO: Failed to read packet header
5/21 11:04:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:09:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:58919>.
5/21 11:09:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:58919>.
5/21 11:09:16 IO: Failed to read packet header
5/21 11:09:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:14:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:54807>.
5/21 11:14:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:54807>.
5/21 11:14:16 IO: Failed to read packet header
5/21 11:14:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:19:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:51189>.
5/21 11:19:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:51189>.
5/21 11:19:16 IO: Failed to read packet header
5/21 11:19:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:24:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:52571>.
5/21 11:24:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:52571>.
5/21 11:24:16 IO: Failed to read packet header
5/21 11:24:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:29:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:41606>.
5/21 11:29:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:41606>.
5/21 11:29:16 IO: Failed to read packet header
5/21 11:29:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:34:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:52965>.
5/21 11:34:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:52965>.
5/21 11:34:16 IO: Failed to read packet header
5/21 11:34:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:39:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:37451>.
5/21 11:39:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:37451>.
5/21 11:39:16 IO: Failed to read packet header
5/21 11:39:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:44:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:42761>.
5/21 11:44:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:42761>.
5/21 11:44:16 IO: Failed to read packet header
5/21 11:44:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:49:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:53223>.
5/21 11:49:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:53223>.
5/21 11:49:16 IO: Failed to read packet header
5/21 11:49:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:54:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:59908>.
5/21 11:54:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:59908>.
5/21 11:54:16 IO: Failed to read packet header
5/21 11:54:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 11:59:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:36863>.
5/21 11:59:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:36863>.
5/21 11:59:16 IO: Failed to read packet header
5/21 11:59:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 12:04:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:41069>.
5/21 12:04:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:41069>.
5/21 12:04:16 IO: Failed to read packet header
5/21 12:04:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 12:09:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:60121>.
5/21 12:09:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:60121>.
5/21 12:09:16 IO: Failed to read packet header
5/21 12:09:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 12:14:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:52260>.
5/21 12:14:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:52260>.
5/21 12:14:16 IO: Failed to read packet header
5/21 12:14:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)
5/21 12:19:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <10.3.29.209:52639>.
5/21 12:19:16 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <10.3.29.209:52639>.
5/21 12:19:16 IO: Failed to read packet header
5/21 12:19:16 DaemonCore: Can't receive command request from 10.3.29.209 (perhaps a timeout?)