8. Multihoming NVEs - NVE Residing in ToR Switch

In this section, we discuss the scenario where the NVEs reside in the ToR switches AND the servers (where VMs are residing) are multihomed to these ToR switches. The multihoming NVE operates in All-Active or Single-Active redundancy mode. If the servers are single-homed to the ToR switches, then the scenario becomes similar to that where the NVE resides on the hypervisor, as discussed in Section 7, as far as the required EVPN functionality is concerned.

[RFC7432] defines a set of BGP routes, attributes, and procedures to support multihoming. We first describe these functions and procedures, then discuss which of these are impacted by the VXLAN (or NVGRE) encapsulation and what modifications are required. As will be seen later in this section, the only EVPN procedure that is impacted by non-MPLS overlay encapsulation (e.g., VXLAN or NVGRE) where it provides space for one ID rather than a stack of labels, is that of split-horizon filtering for multihomed ESs described in Section 8.3.1.

8.1. EVPN Multihoming Features

In this section, we will recap the multihoming features of EVPN to highlight the encapsulation dependencies. The section only describes the features and functions at a high level. For more details, the reader is to refer to [RFC7432].

8.1.1. Multihomed ES Auto-Discovery

EVPN NVEs (or PEs) connected to the same ES (e.g., the same server via Link Aggregation Group (LAG)) can automatically discover each other with minimal to no configuration through the exchange of BGP routes.

8.1.2. Fast Convergence and Mass Withdrawal

EVPN defines a mechanism to efficiently and quickly signal, to remote NVEs, the need to update their forwarding tables upon the occurrence of a failure in connectivity to an ES (e.g., a link or a port failure). This is done by having each NVE advertise an Ethernet A-D route per ES for each locally attached segment. Upon a failure in connectivity to the attached segment, the NVE withdraws the corresponding Ethernet A-D route. This triggers all NVEs that receive the withdrawal to update their next-hop adjacencies for all MAC addresses associated with the ES in question. If no other NVE had advertised an Ethernet A-D route for the same segment, then the NVE that received the withdrawal simply invalidates the MAC entries for that segment. Otherwise, the NVE updates the next-hop adjacency list accordingly.

8.1.3. Split-Horizon

If a server is multihomed to two or more NVEs (represented by an ES ES1) and operating in an All-Active redundancy mode, sends a BUM (i.e., Broadcast, Unknown unicast, or Multicast) packet to one of these NVEs, then it is important to ensure the packet is not looped back to the server via another NVE connected to this server. The filtering mechanism on the NVE to prevent such loop and packet duplication is called "split-horizon filtering".

8.1.4. Aliasing and Backup Path

In the case where a station is multihomed to multiple NVEs, it is possible that only a single NVE learns a set of the MAC addresses associated with traffic transmitted by the station. This leads to a situation where remote NVEs receive MAC Advertisement routes, for these addresses, from a single NVE even though multiple NVEs are connected to the multihomed station. As a result, the remote NVEs are not able to effectively load-balance traffic among the NVEs connected to the multihomed ES. For example, this could be the case when the NVEs perform data-path learning on the access and the load-balancing function on the station hashes traffic from a given source MAC address to a single NVE. Another scenario where this occurs is when the NVEs rely on control-plane learning on the access (e.g., using ARP), since ARP traffic will be hashed to a single link in the LAG.

To alleviate this issue, EVPN introduces the concept of "Aliasing". This refers to the ability of an NVE to signal that it has reachability to a given locally attached ES, even when it has learned no MAC addresses from that segment. The Ethernet A-D route per EVI is used to that end. Remote NVEs that receive MAC Advertisement routes with non-zero ESIs should consider the MAC address as reachable via all NVEs that advertise reachability to the relevant Segment using Ethernet A-D routes with the same ESI and with the Single-Active flag reset.

Backup Path is a closely related function, albeit one that applies to the case where the redundancy mode is Single-Active. In this case, the NVE signals that it has reachability to a given locally attached ES using the Ethernet A-D route as well. Remote NVEs that receive the MAC Advertisement routes, with non-zero ESI, should consider the MAC address as reachable via the advertising NVE. Furthermore, the remote NVEs should install a Backup Path, for said MAC, to the NVE that had advertised reachability to the relevant segment using an Ethernet A-D route with the same ESI and with the Single-Active flag set.

8.1.5. DF Election

If a host is multihomed to two or more NVEs on an ES operating in All-Active redundancy mode, then, for a given EVI, only one of these NVEs, termed the "Designated Forwarder" (DF) is responsible for sending it broadcast, multicast, and, if configured for that EVI, unknown unicast frames.

This is required in order to prevent duplicate delivery of multi-destination frames to a multihomed host or VM, in case of All-Active redundancy.

In NVEs where frames tagged as IEEE 802.1Q [IEEE.802.1Q] are received from hosts, the DF election should be performed based on host VIDs per Section 8.5 of [RFC7432]. Furthermore, multihoming PEs of a given ES MAY perform DF election using configured IDs such as VNI, EVI, normalized VIDs, and etc., as along the IDs are configured consistently across the multihoming PEs.

In GWs where VXLAN-encapsulated frames are received, the DF election is performed on VNIs. Again, it is assumed that, for a given Ethernet segment, VNIs are unique and consistent (e.g., no duplicate VNIs exist).

8.2. Impact on EVPN BGP Routes and Attributes

Since multihoming is supported in this scenario, the entire set of BGP routes and attributes defined in [RFC7432] is used. The setting of the Ethernet Tag field in the MAC Advertisement, Ethernet A-D per EVI, and IMET routes follows that of Section 5.1.3. Furthermore, the setting of the VNI field in the MAC Advertisement and Ethernet A-D per EVI routes follows that of Section 5.1.3.

8.3. Impact on EVPN Procedures

Two cases need to be examined here, depending on whether the NVEs are operating in Single-Active or in All-Active redundancy mode.

First, let's consider the case of Single-Active redundancy mode, where the hosts are multihomed to a set of NVEs; however, only a single NVE is active at a given point of time for a given VNI. In this case, the Aliasing is not required, and the split-horizon filtering may not be required, but other functions such as multihomed ES auto-discovery, fast convergence and mass withdrawal, Backup Path, and DF election are required.

Second, let's consider the case of All-Active redundancy mode. In this case, out of all the EVPN multihoming features listed in Section 8.1, the use of the VXLAN or NVGRE encapsulation impacts the split-horizon and Aliasing features, since those two rely on the MPLS client layer. Given that this MPLS client layer is absent with these types of encapsulations, alternative procedures and mechanisms are needed to provide the required functions. Those are discussed in detail next.

8.3.1. Split Horizon

In EVPN, an MPLS label is used for split-horizon filtering to support All-Active multihoming where an ingress NVE adds a label corresponding to the site of origin (aka an ESI label) when encapsulating the packet. The egress NVE checks the ESI label when attempting to forward a multi-destination frame out an interface, and if the label corresponds to the same site identifier (ESI) associated with that interface, the packet gets dropped. This prevents the occurrence of forwarding loops.

Since VXLAN and NVGRE encapsulations do not include the ESI label, other means of performing the split-horizon filtering function must be devised for these encapsulations. The following approach is recommended for split-horizon filtering when VXLAN (or NVGRE) encapsulation is used.

Every NVE tracks the IP address(es) associated with the other NVE(s) with which it has shared multihomed ESs. When the NVE receives a multi-destination frame from the overlay network, it examines the source IP address in the tunnel header (which corresponds to the ingress NVE) and filters out the frame on all local interfaces connected to ESs that are shared with the ingress NVE. With this approach, it is required that the ingress NVE perform replication locally to all directly attached Ethernet segments (regardless of the DF election state) for all flooded traffic ingress from the access interfaces (i.e., from the hosts). This approach is referred to as "Local Bias", and has the advantage that only a single IP address need be used per NVE for split-horizon filtering, as opposed to requiring an IP address per Ethernet segment per NVE.

In order to allow proper operation of split-horizon filtering among the same group of multihoming PE devices, a mix of PE devices with MPLS over GRE encapsulations running the procedures from [RFC7432] for split-horizon filtering on the one hand and VXLAN/NVGRE encapsulation running local-bias procedures on the other on a given Ethernet segment MUST NOT be configured.

8.3.2. Aliasing and Backup Path

The Aliasing and the Backup Path procedures for VXLAN/NVGRE encapsulation are very similar to the ones for MPLS. In the case of MPLS, Ethernet A-D route per EVI is used for Aliasing when the corresponding ES operates in All-Active multihoming, and the same route is used for Backup Path when the corresponding ES operates in Single-Active multihoming. In the case of VXLAN/NVGRE, the same route is used for the Aliasing and the Backup Path with the difference that the Ethernet Tag and VNI fields in Ethernet A-D per EVI route are set as described in Section 5.1.3.

8.3.3. Unknown Unicast Traffic Designation

In EVPN, when an ingress PE uses ingress replication to flood unknown unicast traffic to egress PEs, the ingress PE uses a different EVPN MPLS label (from the one used for known unicast traffic) to identify such BUM traffic. The egress PEs use this label to identify such BUM traffic and, thus, apply DF filtering for All-Active multihomed sites. In absence of an unknown unicast traffic designation and in the presence of enabling unknown unicast flooding, there can be transient duplicate traffic to All-Active multihomed sites under the following condition: the host MAC address is learned by the egress PE(s) and advertised to the ingress PE; however, the MAC Advertisement has not been received or processed by the ingress PE, resulting in the host MAC address being unknown on the ingress PE but known on the egress PE(s). Therefore, when a packet destined to that host MAC address arrives on the ingress PE, it floods it via ingress replication to all the egress PE(s), and since they are known to the egress PE(s), multiple copies are sent to the All-Active multihomed site. It should be noted that such transient packet duplication only happens when a) the destination host is multihomed via All-Active redundancy mode, b) flooding of unknown unicast is enabled in the network, c) ingress replication is used, and d) traffic for the destination host is arrived on the ingress PE before it learns the host MAC address via BGP EVPN advertisement. If it is desired to avoid occurrence of such transient packet duplication (however low probability that may be), then VXLAN-GPE encapsulation needs to be used between these PEs and the ingress PE needs to set the BUM Traffic Bit (B bit) [VXLAN-GPE] to indicate that this is an ingress-replicated BUM traffic.