HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] RFC: GCB optimization for local network communication



a condor pool that is GCB enabled will currently always communicate via the GCB broker, even between 2 machines on the same private network. this creates needless performance bottlenecks and potential failure points.

to fix it, we need daemons to advertise not only their GCB-provided public IP/port, but also a) their real, local IP/port and b) some unique network identifier. lots of machines could be at IP 192.168.2.*, even if they're in totally different networks and have no way to contact each other directly, so just knowing the real IP and "i'm 192.168.2.3, and i'm trying to talk to 192.168.2.4" doesn't tell you if you need GCB or not.


------------------
proposal part 1:
------------------

instead of jumping through lots of hoops to try to uniquely identify machines or networks in the code, we just punt to the admins. just like they have to specify a unique UID_DOMAIN for that stuff to work, if they're setting up GCB, they have to setup a unique NETWORK_NAME (exact name TBD) and we just use whatever they say. if 2 machines are in the same NETWORK_NAME, they can assume direct communication and avoid GCB (provided they have the real local IP/port, not just the public IP/port in the canonical sinful string).


------------------
proposal part 2:
------------------

how will daemons know the net_name + real ip/port? one avenue i've been investigating is to modify the format of sinful strings, and include all this additional info. something like:

StartdIpAddr = "<public_ip:port><network_name:local_ip:port>"

sadly, there are 698 call sites that reference "sinful" in our source, and an additional 137 that use one of the sinful-string related helper functions (sin_to_string, string_to_port, etc, etc). after spending quite a bit of time looking at this code, it's clear we're basically doomed if we change the format of the strings like this. old daemons *will* seg fault (static buffers in DaemonCore, among many), if the size of the sinful string more than doubles. a lot of code will do utterly wrong things. :(

so, we have 2 real options:

a) have our "network incompatibility flag day", declare that 6.9.x is utterly incompatible with everything before it, and change the format of the sinful string however we want. while we're at it, we'd probably change the names of the classad attributes, so we just use "MyIpAddr" everywhere, instead of "StartdIpAddr" vs. "ScheddIpAddr", etc, etc. we could also rip out at least 1000 lines of code, maybe more, of cruft/bloat from our varied attempts to maintain backwards compatibility.

b) forget about changing the existing sinful-string related attributes and functions, and handle this GCB optimization with a brand new classad attr, something like:

RealNetworkId = "<network_name:ip:port>"

this would be the admin-specified network_name, and the real local IP/ port. then, we'd just have to incrementally change parts of the Condor code to make use of this new attribute and do the optimization. it seems like with relatively small changes (mostly to DaemonCore and DaemonClient) we could handle a major portion of the network communications. we might miss some outlying cases in the first pass, but we could fix those in stages. everything would continue to be compatible, and would work... it's just a question of if a given connection could use this optimization to skip talking to the GCB broker or not.


option (a) certainly has a lot of appeal, but it's a rather huge change for what is ultimately a pretty small subset of our users. i'd still *love* to purge as much compatibility cruft and bloat as possible, but this might not be the best time/reason to do so.

given all the facts, i'm voting for b. i'd like a decision ASAP so we can try to get as much of this done this week while i'm in town.

thanks,
-derek