[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Schedds unresponsive when 1 of 2 HA central managers is down



Hi,

In our HTCondor pool we are using a HA central manager setup using 2 machines. I've noticed that if 1 of these 2 machines is down, the schedds (on other machines) become less responsive than they normally are. condor_q sometimes takes a long time or gives an error like this:

-- Failed to fetch ads from: <*.*.*.*:45291> : <schedd hostname>
SECMAN:2007:Failed to end classad message.

There are, as expected, lots of errors in /var/log/condor/SchedLog about not being able to connect to the collector which is down, e.g. [1], [2] (IP address changed for this email).

The config file for each machine contains:

COLLECTOR_HOST = condor01.xxx, condor02.xxx

If I remove the machine which is down from the schedd's config file the problem disappears. Also, the problem of course doesn't happen when both central managers are up.

Is this behaviour expected, or could it be caused by my configuration in some way? I'm using HTCondor 8.0.0.

Thanks,
Andrew.

[1]
07/11/13 07:21:23 Calling Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SCHEDD_AD> (5)
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Failed to create security session to <*.*.*.*:9618> with TCP.|SECMAN:2003:TCP connection to <*.*.*.*:9618> failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 Return from Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SCHEDD_AD> 0.0003s
07/11/13 07:21:23 Calling Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SUBMITTOR_AD> (6)
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Failed to create security session to <*.*.*.*:9618> with TCP.|SECMAN:2003:TCP connection to <*.*.*.*:9618> failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 Return from Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SUBMITTOR_AD> 0.0005s

[2]
07/11/13 07:27:05 (pid:27146) attempt to connect to <*.*.*.*:9618> failed: No route to host (connect errno = 113).  Will keep trying for 20 total seconds (18 to go).

-- 
Scanned by iCritical.