Skip to main content

7. Routing Convergence Properties

This section reviews routing convergence properties in the proposed design. A case is made that sub-second convergence is achievable if the implementation supports fast EBGP peering session deactivation and timely RIB and FIB updates upon failure of the associated link.

7.1. Fault Detection Timing

BGP typically relies on an IGP to route around link/node failures inside an AS, and implements either a polling-based or an event-driven mechanism to obtain updates on IGP state changes. The proposed routing design does not use an IGP, so the remaining mechanisms that could be used for fault detection are BGP keep-alive time-out (or any other type of keep-alive mechanism) and link-failure triggers.

Relying solely on BGP keep-alive packets may result in high convergence delays, on the order of multiple seconds (on many BGP implementations the minimum configurable BGP hold timer value is three seconds). However, many BGP implementations can shut down local EBGP peering sessions in response to the "link down" event for the outgoing interface used for BGP peering. This feature is sometimes called "fast fallover". Since links in modern data centers are predominantly point-to-point fiber connections, a physical interface failure is often detected in milliseconds and subsequently triggers a BGP reconvergence.

Ethernet links may support failure signaling or detection standards such as Connectivity Fault Management (CFM) as described in [IEEE8021Q]; this may make failure detection more robust. Alternatively, some platforms may support Bidirectional Forwarding Detection (BFD) [RFC5880] to allow for sub-second failure detection and fault signaling to the BGP process. However, the use of either of these presents additional requirements to vendor software and possibly hardware, and may contradict REQ1. Until recently with [RFC7130], BFD also did not allow detection of a single member link failure on a LAG, which would have limited its usefulness in some designs.

7.2. Event Propagation Timing

In the proposed design, the impact of the BGP MinRouteAdvertisementIntervalTimer (MRAI timer), as specified in Section 9.2.1.1 of [RFC4271], should be considered. Per the standard, it is required for BGP implementations to space out consecutive BGP UPDATE messages by at least MRAI seconds, which is often a configurable value. The initial BGP UPDATE messages after an event carrying withdrawn routes are commonly not affected by this timer. The MRAI timer may present significant convergence delays when a BGP speaker "waits" for the new path to be learned from its peers and has no local backup path information.

In a Clos topology, each EBGP speaker typically has either one path (Tier 2 devices don't accept paths from other Tier 2 in the same cluster due to same ASN) or N paths for the same prefix, where N is a significantly large number, e.g., N=32 (the ECMP fan-out to the next tier). Therefore, if a link fails to another device from which a path is received there is either no backup path at all (e.g., from the perspective of a Tier 2 switch losing the link to a Tier 3 device), or the backup is readily available in BGP Loc-RIB (e.g., from the perspective of a Tier 2 device losing the link to a Tier 1 switch). In the former case, the BGP withdrawal announcement will propagate without delay and trigger reconvergence on affected devices. In the latter case, the best path will be re-evaluated, and the local ECMP group corresponding to the new next-hop set will be changed. If the BGP path was the best path selected previously, an "implicit withdraw" will be sent via a BGP UPDATE message as described as Option b in Section 3.1 of [RFC4271] due to the BGP AS_PATH attribute changing.

7.3. Impact of Clos Topology Fan-Outs

Clos topology has large fan-outs, which may impact the "Up->Down" convergence in some cases, as described in this section. In a situation when a link between Tier 3 and Tier 2 device fails, the Tier 2 device will send BGP UPDATE messages to all upstream Tier 1 devices, withdrawing the affected prefixes. The Tier 1 devices, in turn, will relay these messages to all downstream Tier 2 devices (except for the originator). Tier 2 devices other than the one originating the UPDATE should then wait for ALL upstream Tier 1 devices to send an UPDATE message before removing the affected prefixes and sending corresponding UPDATE downstream to connected Tier 3 devices. If the original Tier 2 device or the relaying Tier 1 devices introduce some delay into their UPDATE message announcements, the result could be UPDATE message "dispersion", that could be as long as multiple seconds. In order to avoid such a behavior, BGP implementations must support "update groups". The "update group" is defined as a collection of neighbors sharing the same outbound policy -- the local speaker will send BGP updates to the members of the group synchronously.

The impact of such "dispersion" grows with the size of topology fan-out and could also grow under network convergence churn. Some operators may be tempted to introduce "route flap dampening" type features that vendors include to reduce the control-plane impact of rapidly flapping prefixes. However, due to issues described with false positives in these implementations especially under such "dispersion" events, it is not recommended to enable this feature in this design. More background and issues with "route flap dampening" and possible implementation changes that could affect this are well described in [RFC7196].

7.4. Failure Impact Scope

A network is declared to converge in response to a failure once all devices within the failure impact scope are notified of the event and have recalculated their RIBs and consequently updated their FIBs. Larger failure impact scope typically means slower convergence since more devices have to be notified, and results in a less stable network. In this section, we describe BGP's advantages over link-state routing protocols in reducing failure impact scope for a Clos topology.

BGP behaves like a distance-vector protocol in the sense that only the best path from the point of view of the local router is sent to neighbors. As such, some failures are masked if the local node can immediately find a backup path and does not have to send any updates further. Notice that in the worst case, all devices in a data center topology have to either withdraw a prefix completely or update the ECMP groups in their FIBs. However, many failures will not result in such a wide impact. There are two main failure types where impact scope is reduced:

  • Failure of a link between Tier 2 and Tier 1 devices: In this case, a Tier 2 device will update the affected ECMP groups, removing the failed link. There is no need to send new information to downstream Tier 3 devices, unless the path was selected as best by the BGP process, in which case only an "implicit withdraw" needs to be sent and this should not affect forwarding. The affected Tier 1 device will lose the only path available to reach a particular cluster and will have to withdraw the associated prefixes. Such a prefix withdrawal process will only affect Tier 2 devices directly connected to the affected Tier 1 device. The Tier 2 devices receiving the BGP UPDATE messages withdrawing prefixes will simply have to update their ECMP groups. The Tier 3 devices are not involved in the reconvergence process.

  • Failure of a Tier 1 device: In this case, all Tier 2 devices directly attached to the failed node will have to update their ECMP groups for all IP prefixes from a non-local cluster. The Tier 3 devices are once again not involved in the reconvergence process, but may receive "implicit withdraws" as described above.

Even in the case of such failures where multiple IP prefixes will have to be reprogrammed in the FIB, it is worth noting that all of these prefixes share a single ECMP group on a Tier 2 device. Therefore, in the case of implementations with a hierarchical FIB, only a single change has to be made to the FIB. "Hierarchical FIB" here means FIB structure where the next-hop forwarding information is stored separately from the prefix lookup table, and the latter only stores pointers to the respective forwarding information. See [BGP-PIC] for discussion of FIB hierarchies and fast convergence.

Even though BGP offers reduced failure scope for some cases, further reduction of the fault domain using summarization is not always possible with the proposed design, since using this technique may create routing black-holes as mentioned previously. Therefore, the worst failure impact scope on the control plane is the network as a whole -- for instance, in the case of a link failure between Tier 2 and Tier 3 devices. The amount of impacted prefixes in this case would be much less than in the case of a failure in the upper layers of a Clos network topology. The property of having such large failure scope is not a result of choosing EBGP in the design but rather a result of using the Clos topology.

7.5. Routing Micro-Loops

When a downstream device, e.g., Tier 2 device, loses all paths for a prefix, it normally has the default route pointing toward the upstream device -- in this case, the Tier 1 device. As a result, it is possible to get in the situation where a Tier 2 switch loses a prefix, but a Tier 1 switch still has the path pointing to the Tier 2 device; this results in a transient micro-loop, since the Tier 1 switch will keep passing packets to the affected prefix back to the Tier 2 device, and the Tier 2 will bounce them back again using the default route. This micro-loop will last for the time it takes the upstream device to fully update its forwarding tables.

To minimize impact of such micro-loops, Tier 2 and Tier 1 switches can be configured with static "discard" or "null" routes that will be more specific than the default route for prefixes missing during network convergence. For Tier 2 switches, the discard route should be a summary route, covering all server subnets of the underlying Tier 3 devices. For Tier 1 devices, the discard route should be a summary covering the server IP address subnets allocated for the whole data center. Those discard routes will only take precedence for the duration of network convergence, until the device learns a more specific prefix via a new path.