3. Data Center Topologies Overview
3. Data Center Topologies Overview
This section provides an overview of two general types of data center designs -- hierarchical (also known as "tree-based") and Clos-based network designs.
3.1 Traditional DC Topology
In the networking industry, a common design choice for data centers typically looks like an (upside down) tree with redundant uplinks and three layers of hierarchy namely; core, aggregation/distribution, and access layers (see Figure 1). To accommodate bandwidth demands, each higher layer, from the server towards DC egress or WAN, has higher port density and bandwidth capacity where the core functions as the "trunk" of the tree-based design. To keep terminology uniform and for comparison with other designs, in this document these layers will be referred to as Tier 1, Tier 2 and Tier 3 "tiers", instead of core, aggregation, or access layers.
+------+ +------+
| | | |
| |--| | Tier 1
| | | |
+------+ +------+
| | | |
+---------+ | | +----------+
| +-------+--+------+--+-------+ |
| | | | | | | |
+----+ +----+ +----+ +----+
| | | | | | | |
| |-----| | | |-----| | Tier 2
| | | | | | | |
+----+ +----+ +----+ +----+
| | | |
| | | |
| +-----+ | | +-----+ |
+-| |-+ +-| |-+ Tier 3
+-----+ +-----+
| | | | | |
<- Servers -> <- Servers ->
Figure 1: Typical DC Network Topology
Unfortunately, as noted previously, it is not possible to scale a tree-based design to a large enough degree for handling large-scale designs due to the inability to be able to acquire Tier 1 devices with a large enough port density to sufficiently scale Tier 2. Also, continuous upgrades or replacement of the upper-tier devices are required as deployment size or bandwidth requirements increase, which is operationally complex. For this reason, REQ1 is in place, eliminating this type of design from consideration.
3.2 Clos Network Topology
This section describes a common design for horizontally scalable topology in large-scale data centers in order to meet REQ1.
3.2.1 Overview
A common choice for a horizontally scalable topology is a folded Clos topology, sometimes called "fat-tree" (for example, [INTERCON] and [ALFARES2008]). This topology features an odd number of stages (sometimes known as "dimensions") and is commonly made of uniform elements, e.g., network switches with the same port count. Therefore, the choice of folded Clos topology satisfies REQ1 and facilitates REQ2. See Figure 2 below for an example of a folded 3-stage Clos topology (3 stages counting Tier 2 stage twice, when tracing a packet flow):
+-------+
| |----------------------------+
| |------------------+ |
| |--------+ | |
+-------+ | | |
+-------+ | | |
| |--------+---------+-------+ |
| |--------+-------+ | | |
| |------+ | | | | |
+-------+ | | | | | |
+-------+ | | | | | |
| |------+-+-------+-+-----+ | |
| |------+-+-----+ | | | | |
| |----+ | | | | | | | |
+-------+ | | | | | | ---------> M links
Tier 1 | | | | | | | | |
+-------+ +-------+ +-------+
| | | | | |
| | | | | | Tier 2
| | | | | |
+-------+ +-------+ +-------+
| | | | | | | | |
| | | | | | ---------> N Links
| | | | | | | | |
O O O O O O O O O Servers
Figure 2: 3-Stage Folded Clos Topology
This topology is often also referred to as a "Leaf and Spine" network, where "Spine" is the name given to the middle stage of the Clos topology (Tier 1) and "Leaf" is the name of input/output stage (Tier 2). For uniformity, this document will refer to these layers using the "Tier n" notation.
3.2.2 Clos Topology Properties
The following are some key properties of the Clos topology:
-
The topology is fully non-blocking, or more accurately non-interfering, if M >= N and oversubscribed by a factor of N/M otherwise. Here M and N is the uplink and downlink port count respectively, for a Tier 2 switch as shown in Figure 2.
-
Utilizing this topology requires control and data-plane support for ECMP with a fan-out of M or more.
-
Tier 1 switches have exactly one path to every server in this topology. This is an important property that makes route summarization dangerous in this topology (see Section 8.2 below).
-
Traffic flowing from server to server is load balanced over all available paths using ECMP.
3.2.3 Scaling the Clos Topology
A Clos topology can be scaled either by increasing network element port density or by adding more stages, e.g., moving to a 5-stage Clos, as illustrated in Figure 3 below:
Tier 1
+-----+
Cluster | |
+----------------------------+ +--| |--+
| | | +-----+ |
| Tier 2 | | | Tier 2
| +-----+ | | +-----+ | +-----+
| +-------------| DEV |------+--| |--+--| |-------------+
| | +-----| C |------+ | | +--| |-----+ |
| | | +-----+ | +-----+ +-----+ | |
| | | | | |
| | | +-----+ | +-----+ +-----+ | |
| | +-----------| DEV |------+ | | +--| |-----------+ |
| | | | +---| D |------+--| |--+--| |---+ | | |
| | | | | +-----+ | | +-----+ | +-----+ | | | |
| | | | | | | | | | | |
| +-----+ +-----+ | | +-----+ | +-----+ +-----+
| | DEV | | DEV | | +--| |--+ | | | |
| | A | | B | Tier 3 | | | Tier 3 | | | |
| +-----+ +-----+ | +-----+ +-----+ +-----+
| | | | | | | | | |
| O O O O | O O O O
| Servers | Servers
+----------------------------+
Figure 3: 5-Stage Clos Topology
The small example of topology in Figure 3 is built from devices with a port count of 4. In this document, one set of directly connected Tier 2 and Tier 3 devices along with their attached servers will be referred to as a "cluster". For example, DEV A, B, C, D, and the servers that connect to DEV A and B, on Figure 3 form a cluster. The concept of a cluster may also be a useful concept as a single deployment or maintenance unit that can be operated on at a different frequency than the entire topology.
In practice, Tier 3 of the network, which is typically Top-of-Rack switches (ToRs), is where oversubscription is introduced to allow for packaging of more servers in the data center while meeting the bandwidth requirements for different types of applications. The main reason to limit oversubscription at a single layer of the network is to simplify application development that would otherwise need to account for multiple bandwidth pools: within rack (Tier 3), between racks (Tier 2), and between clusters (Tier 1). Since oversubscription does not have a direct relationship to the routing design, it is not discussed further in this document.
3.2.4 Managing the Size of Clos Topology Tiers
If a data center network size is small, it is possible to reduce the number of switches in Tier 1 or Tier 2 of a Clos topology by a factor of two. To understand how this could be done, take Tier 1 as an example. Every Tier 2 device connects to a single group of Tier 1 devices. If half of the ports on each of the Tier 1 devices are not being used, then it is possible to reduce the number of Tier 1 devices by half and simply map two uplinks from a Tier 2 device to the same Tier 1 device that were previously mapped to different Tier 1 devices. This technique maintains the same bandwidth while reducing the number of elements in Tier 1, thus saving on CAPEX. The tradeoff, in this example, is the reduction of maximum DC size in terms of overall server count by half.
In this example, Tier 2 devices will be using two parallel links to connect to each Tier 1 device. If one of these links fails, the other will pick up all traffic of the failed link, possibly resulting in heavy congestion and quality of service degradation if the path determination procedure does not take bandwidth amount into account, since the number of upstream Tier 1 devices is likely wider than two. To avoid this situation, parallel links can be grouped in link aggregation groups (LAGs), e.g., [IEEE8023AD], with widely available implementation settings that take the whole "bundle" down upon a single link failure. Equivalent techniques that enforce "fate sharing" on the parallel links can be used in place of LAGs to achieve the same effect. As a result of such fate-sharing, traffic from two or more failed links will be rebalanced over the multitude of remaining paths that equals the number of Tier 1 devices. This example is using two links for simplicity, having more links in a bundle will have less impact on capacity upon a member-link failure.