Skip to main content

8. Additional Options for Design

8.1. Third-Party Route Injection

BGP allows for a "third-party", i.e., a directly attached BGP speaker, to inject routes anywhere in the network topology, meeting REQ5. This can be achieved by peering via a multi-hop BGP session with some or even all devices in the topology. Furthermore, BGP diverse path distribution [RFC6774] could be used to inject multiple BGP next hops for the same prefix to facilitate load balancing, or using the BGP ADD-PATH capability [RFC7911] if supported by the implementation. Unfortunately, in many implementations, ADD-PATH has been found to only support IBGP properly in the use cases for which it was originally optimized; this limits the "third-party" peering to IBGP only.

To implement route injection in the proposed design, a third-party BGP speaker may peer with Tier 3 and Tier 1 switches, injecting the same prefix, but using a special set of BGP next hops for Tier 1 devices. Those next hops are assumed to resolve recursively via BGP, and could be, for example, IP addresses on Tier 3 devices. The resulting forwarding table programming could provide desired traffic proportion distribution among different clusters.

8.2. Route Summarization within Clos Topology

As mentioned previously, route summarization is not possible within the proposed Clos topology since it makes the network susceptible to route black-holing under single link failures. The main problem is the limited number of redundant paths between network elements, e.g., there is only a single path between any pair of Tier 1 and Tier 3 devices. However, some operators may find route aggregation desirable to improve control-plane stability.

If any technique to summarize within the topology is planned, modeling of the routing behavior and potential for black-holing should be done not only for single or multiple link failures, but also for fiber pathway failures or optical domain failures when the topology extends beyond a physical location. Simple modeling can be done by checking the reachability on devices doing summarization under the condition of a link or pathway failure between a set of devices in every tier as well as to the WAN routers when external connectivity is present.

Route summarization would be possible with a small modification to the network topology, though the tradeoff would be reduction of the total size of the network as well as network congestion under specific failures. This approach is very similar to the technique described above, which allows Border Routers to summarize the entire data center address space.

8.2.1. Collapsing Tier 1 Devices Layer

In order to add more paths between Tier 1 and Tier 3 devices, group Tier 2 devices into pairs, and then connect the pairs to the same group of Tier 1 devices. This is logically equivalent to "collapsing" Tier 1 devices into a group of half the size, merging the links on the "collapsed" devices. The result is illustrated in Figure 6. For example, in this topology DEV C and DEV D connect to the same set of Tier 1 devices (DEV 1 and DEV 2), whereas before they were connecting to different groups of Tier 1 devices.

              Tier 2       Tier 1       Tier 2
+-----+ +-----+ +-----+
+------------| DEV |------| DEV |------| |-------------+
| +----| C |--++--| 1 |--++--| |-----+ |
| | +-----+ || +-----+ || +-----+ | |
| | || || | |
| | +-----+ || +-----+ || +-----+ | |
| +-----+----| DEV |--++--| DEV |--++--| |-----+-----+ |
| | | +--| D |------| 2 |------| |---+ | | |
| | | | +-----+ +-----+ +-----+ | | | |
| | | | | | | |
+-----+ +-----+ +-----+ +-----+
| DEV | | DEV | | | | |
| A | | B | Tier 3 Tier 3 | | | |
+-----+ +-----+ +-----+ +-----+
| | | | | | | |
O O O O <- Servers -> O O O O

Figure 6: 5-Stage Clos Topology

Having this design in place, Tier 2 devices may be configured to advertise only a default route down to Tier 3 devices. If a link between Tier 2 and Tier 3 fails, the traffic will be re-routed via the second available path known to a Tier 2 switch. It is still not possible to advertise a summary route covering prefixes for a single cluster from Tier 2 devices since each of them has only a single path down to this prefix. It would require dual-homed servers to accomplish that. Also note that this design is only resilient to single link failures. It is possible for a double link failure to isolate a Tier 2 device from all paths toward a specific Tier 3 device, thus causing a routing black-hole.

A result of the proposed topology modification would be a reduction of the port capacity of Tier 1 devices. This limits the maximum number of attached Tier 2 devices, and therefore will limit the maximum DC network size. A larger network would require different Tier 1 devices that have higher port density to implement this change.

Another problem is traffic rebalancing under link failures. Since there are two paths from Tier 1 to Tier 3, a failure of the link between Tier 1 and Tier 2 switch would result in all traffic that was taking the failed link to switch to the remaining path. This will result in doubling the link utilization on the remaining link.

8.2.2. Simple Virtual Aggregation

A completely different approach to route summarization is possible, provided that the main goal is to reduce the FIB size, while allowing the control plane to disseminate full routing information. Firstly, it could be easily noted that in many cases multiple prefixes, some of which are less specific, share the same set of the next hops (same ECMP group). For example, from the perspective of Tier 3 devices, all routes learned from upstream Tier 2 devices, including the default route, will share the same set of BGP next hops, provided that there are no failures in the network. This makes it possible to use the technique similar to that described in [RFC6769] and only install the least specific route in the FIB, ignoring more specific routes if they share the same next-hop set. For example, under normal network conditions, only the default route needs to be programmed into the FIB.

Furthermore, if the Tier 2 devices are configured with summary prefixes covering all of their attached Tier 3 device's prefixes, the same logic could be applied in Tier 1 devices as well and, by induction to Tier 2/Tier 3 switches in different clusters. These summary routes should still allow for more specific prefixes to leak to Tier 1 devices, to enable detection of mismatches in the next-hop sets if a particular link fails, thus changing the next-hop set for a specific prefix.

Restating once again, this technique does not reduce the amount of control-plane state (i.e., BGP UPDATEs, BGP Loc-RIB size), but only allows for more efficient FIB utilization, by detecting more specific prefixes that share their next-hop set with a subsuming less specific prefix.

8.3. ICMP Unreachable Message Masquerading

This section discusses some operational aspects of not advertising point-to-point link subnets into BGP, as previously identified as an option in Section 5.2.3. The operational impact of this decision could be seen when using the well-known "traceroute" tool. Specifically, IP addresses displayed by the tool will be the link's point-to-point addresses, and hence will be unreachable for management connectivity. This makes some troubleshooting more complicated.

One way to overcome this limitation is by using the DNS subsystem to create the "reverse" entries for these point-to-point IP addresses pointing to the same name as the loopback address. The connectivity then can be made by resolving this name to the "primary" IP address of the device, e.g., its Loopback interface, which is always advertised into BGP. However, this creates a dependency on the DNS subsystem, which may be unavailable during an outage.

Another option is to make the network device perform IP address masquerading, that is, rewriting the source IP addresses of the appropriate ICMP messages sent by the device with the "primary" IP address of the device. Specifically, the ICMP Destination Unreachable Message (type 3) code 3 (port unreachable) and ICMP Time Exceeded (type 11) code 0 are required for correct operation of the "traceroute" tool. With this modification, the "traceroute" probes sent to the devices will always be sent back with the "primary" IP address as the source, allowing the operator to discover the "reachable" IP address of the box. This has the downside of hiding the address of the "entry point" into the device. If the devices support [RFC5837], this may allow the best of both worlds by providing the information about the incoming interface even if the return address is the "primary" IP address.