[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedds unresponsive when 1 of 2 HA central managers is down



Hi Andrew,

It would be helpful to see your full schedd log. Perhaps you could post that to condor-admin?

I'd be looking for cases where the schedd blocked for many seconds while trying to communicate with the dead collector. There are protections against this sort of thing, but perhaps a case has slipped through the cracks.

--Dan

On 7/11/13 1:55 AM, andrew.lahiff@xxxxxxxxxx wrote:
Hi,

In our HTCondor pool we are using a HA central manager setup using 2 machines. I've noticed that if 1 of these 2 machines is down, the schedds (on other machines) become less responsive than they normally are. condor_q sometimes takes a long time or gives an error like this:

-- Failed to fetch ads from: <*.*.*.*:45291> : <schedd hostname>
SECMAN:2007:Failed to end classad message.

There are, as expected, lots of errors in /var/log/condor/SchedLog about not being able to connect to the collector which is down, e.g. [1], [2] (IP address changed for this email).

The config file for each machine contains:

COLLECTOR_HOST = condor01.xxx, condor02.xxx

If I remove the machine which is down from the schedd's config file the problem disappears. Also, the problem of course doesn't happen when both central managers are up.

Is this behaviour expected, or could it be caused by my configuration in some way? I'm using HTCondor 8.0.0.

Thanks,
Andrew.

[1]
07/11/13 07:21:23 Calling Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SCHEDD_AD> (5)
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Failed to create security session to <*.*.*.*:9618> with TCP.|SECMAN:2003:TCP connection to <*.*.*.*:9618> failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 Return from Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SCHEDD_AD> 0.0003s
07/11/13 07:21:23 Calling Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SUBMITTOR_AD> (6)
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <*.*.*.*:9618>, but it failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 ERROR: SECMAN:2004:Failed to create security session to <*.*.*.*:9618> with TCP.|SECMAN:2003:TCP connection to <*.*.*.*:9618> failed.
07/11/13 07:21:23 Failed to start non-blocking update to <*.*.*.*:9618>.
07/11/13 07:21:23 Return from Handler <SecManStartCommand::WaitForSocketCallback UPDATE_SUBMITTOR_AD> 0.0005s

[2]
07/11/13 07:27:05 (pid:27146) attempt to connect to <*.*.*.*:9618> failed: No route to host (connect errno = 113).  Will keep trying for 20 total seconds (18 to go).