Skip to main content

6. ECMP Considerations

This section covers the Equal Cost Multipath (ECMP) functionality for Clos topology and discusses a few special requirements.

6.1. Basic ECMP

ECMP is the fundamental load-sharing mechanism used by a Clos topology. Effectively, every lower-tier device will use all of its directly attached upper-tier devices to load-share traffic destined to the same IP prefix. The number of ECMP paths between any two Tier 3 devices in Clos topology is equal to the number of the devices in the middle stage (Tier 1). For example, Figure 5 illustrates a topology where Tier 3 device A has four paths to reach servers X and Y, via Tier 2 devices B and C and then Tier 1 devices 1, 2, 3, and 4, respectively.

                          Tier 1
+-----+
| DEV |
+->| 1 |--+
| +-----+ |
Tier 2 | | Tier 2
+-----+ | +-----+ | +-----+
+----------->| DEV |--+->| DEV |--+--| |-------------+
| +----| B |--+ | 2 | +--| |-----+ |
| | +-----+ +-----+ +-----+ | |
| | | |
| | +-----+ +-----+ +-----+ | |
| +-----+--->| DEV |--+ | DEV | +--| |-----+-----+ |
| | | +--| C |--+->| 3 |--+--| |---+ | | |
| | | | +-----+ | +-----+ | +-----+ | | | |
| | | | | | | | | |
+-----+ +-----+ | +-----+ | +-----+ +-----+
| DEV | | | Tier 3+->| DEV |--+ Tier 3 | | | |
| A | | | | 4 | | | | |
+-----+ +-----+ +-----+ +-----+ +-----+
| | | | | | | |
O O O O <- Servers -> X Y O O

Figure 5: ECMP Fan-Out Tree from A to X and Y

The ECMP requirement implies that the BGP implementation must support multipath fan-out for up to the maximum number of devices directly attached at any point in the topology in the upstream or downstream direction. Normally, this number does not exceed half of the ports found on a device in the topology. For example, an ECMP fan-out of 32 would be required when building a Clos network using 64-port devices. The Border Routers may need to have wider fan-out to be able to connect to a multitude of Tier 1 devices if route summarization at Border Router level is implemented as described in Section 5.2.5. If a device's hardware does not support wider ECMP, logical link-grouping (link-aggregation at Layer 2) could be used to provide "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to compensate for fan-out limitations. However, this approach increases the risk of flow polarization, as less entropy will be available at the second stage of ECMP.

Most BGP implementations declare paths to be equal from an ECMP perspective if they match up to and including step (e) in Section 9.1.2.2 of [RFC4271]. In the proposed network design there is no underlying IGP, so all IGP costs are assumed to be zero or otherwise the same value across all paths and policies may be applied as necessary to equalize BGP attributes that vary in vendor defaults, such as the MULTI_EXIT_DISC (MED) attribute and origin code. For historical reasons, it is also useful to not use 0 as the equalized MED value; this and some other useful BGP information is available in [RFC4277]. Routing loops are unlikely due to the BGP best-path selection process (which prefers shorter AS_PATH length), and longer paths through the Tier 1 devices (which don't allow their own ASN in the path) are not possible.

6.2. BGP ECMP over Multiple ASNs

For application load-balancing purposes, it is desirable to have the same prefix advertised from multiple Tier 3 devices. From the perspective of other devices, such a prefix would have BGP paths with different AS_PATH attribute values, while having the same AS_PATH attribute lengths. Therefore, BGP implementations must support load-sharing over the above-mentioned paths. This feature is sometimes known as "multipath relax" or "multipath multiple-AS" and effectively allows for ECMP to be done across different neighboring ASNs if all other attributes are equal as already described in the previous section.

6.3. Weighted ECMP

It may be desirable for the network devices to implement "weighted" ECMP, to be able to send more traffic over some paths in ECMP fan-out. This could be helpful to compensate for failures in the network and send more traffic over paths that have more capacity. The prefixes that require weighted ECMP would have to be injected using remote BGP speaker (central agent) over a multi-hop session as described further in Section 8.1. If support in implementations is available, weight distribution for multiple BGP paths could be signaled using the technique described in [LINK].

6.4. Consistent Hashing

It is often desirable to have the hashing function used for ECMP to be consistent (see [CONS-HASH]), to minimize the impact on flow to next-hop affinity changes when a next hop is added or removed to an ECMP group. This could be used if the network device is used as a load balancer, mapping flows toward multiple destinations -- in this case, losing or adding a destination will not have a detrimental effect on currently established flows. One particular recommendation on implementing consistent hashing is provided in [RFC2992], though other implementations are possible. This functionality could be naturally combined with weighted ECMP, with the impact of the next hop changes being proportional to the weight of the given next hop. The downside of consistent hashing is increased load on hardware resource utilization, as typically more resources (e.g., Ternary Content-Addressable Memory (TCAM) space) are required to implement a consistent-hashing function.