[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SharedPortServer: server was busy

On 2/15/2016 8:45 AM, Vladimir Brik wrote:

SharedPortLog file on our central manager has a lot of entries like:

SharedPortServer: server was busy, failed to connect to collector as
requested by <>: Resource temporarily unavailable

Sometimes, I see hundreds of such messages generated per second every
few minutes.

Is the problem that the collector doesn't respond quickly enough, or
that shared_port can't handle the volume of connections, or something else?

It is the first case you mention - the problem is that the shared_port tried to forward the connection to the collector, but the collector's listen queue is full because the collector is not responsive enough.

Are there any configuration tweaks I could try to alleviate this?

What version of HTCondor are you running (always a good idea to let us know...) ?

A while back we did fix a bug where the collector would periodically pause when it was configured to use shared_port. I think this was ultimately fixed in v8.4.4+ in stable series or v8.5.2+ in developer. If this is the problem, then simply upgrading should fix it, or (if you cannot upgrade for some reason) turning off shared port via USE_SHARED_PORT=False. This would be my first guess, esp if your collector seemed to be doing just fine before you started using it in conjunction with shared_port.

But another possibility is your collector is simply overloaded. Some possible problems with pithy solutions -

Q: Do you use strong authentication (SSL, GSI, etc) to your collector, esp if you have execute nodes spread out over wide-area connections (i.e. high latency networks) ? A: Consider horizontally scaling the collector as described here:

Q: Do you have a lot (thousands) of slots behind private networks and thus need to use CCB? A: Consider running additional instances of the condor_collector just to handle CCB requests, separate from your central manager collector

Q: Do you have a lot of users or monitoring scripts constantly running condor_status ? A: Consider increasing COLLECTOR_QUERY_WORKERS setting in your central manager condor_config to gain increased collector query performance at the cost of greater memory usage.

Hope the above helps,