The Case for Rethinking SDN Infrastructure¶

Context¶

dSDN is motivated by our experiences running two WANs. Our B4 network [27] is based on cSDN, similar to other cSDN-based networks described in the literature [11, 12, 25], while our B2 network is a traditional protocol-based WAN. For capacity-aware path selection, B4 runs centralized TE. By contrast, B2 runs RSVP-TE over IS-IS, and BGP: link state and capacity-related TE attributes are disseminated using IS-IS, based on which each ingress or “head-end” router runs a constrained shortest-path computation [48] to compute the shortest path with available capacity from itself to each destination. The head-end router then uses RSVP [6] to signal other routers along the selected path, reserving capacity at each. If signaling is successful, the path is installed in the network. If not, e.g., because any router on the signaled path no longer has enough capacity available, then the head-end router tries again using a different path. Thus each head-end acts independently and makes greedy decisions based on its local view of available capacity. Compared to cSDN’s centralized TE, RSVP-TE has been shown to suffer suboptimal paths and hence lower network efficiency (in the sense of lower network utilization) [25].

The dSDN architecture we develop in this paper can serve as a simpler yet equally efficient alternative to B4’s cSDN architecture and as a simpler and more efficient alternative to B2’s RSVP-based architecture.

dSDN的提出基于我们在运行两个广域网（WAN）中的经验。我们的B4网络 [27] 基于集中式SDN（cSDN），类似于文献中描述的其他基于cSDN的网络 [11, 12, 25]，而我们的B2网络则是基于传统协议的广域网。在容量感知路径选择方面，B4运行集中式流量工程（TE）。与此相比，B2使用 RSVP-TE 与 IS-IS 和 BGP 进行操作：链路状态和容量相关的TE属性通过IS-IS进行传播，基于这些信息，每个入口路由器或“头端”路由器运行一个约束最短路径计算 [48]，从自身到每个目的地计算具有可用容量的最短路径。然后，头端路由器使用RSVP [6] 向沿选定路径的其他路由器发送信令，在每个路由器上预留容量。如果信令成功，路径就会被安装到网络中。如果不成功，例如，如果路径上的任何路由器不再有足够的可用容量，则头端路由器会尝试使用不同的路径。因此，每个头端路由器根据其本地的可用容量视图独立地做出贪婪的决策。与集中式SDN的TE相比，RSVP-TE已被证明会导致次优路径，从而降低网络效率（即网络利用率较低） [25]。

Note

B4 Network: cSDN over centralized TE
B2 Network: RSVP-TE over IS-IS + BGP

我们在本文中开发的dSDN架构可以作为B4的cSDN架构的一个更简单但同样高效的替代方案，也可以作为B2的基于RSVP的架构的一个更简单和更高效的替代方案。

Warning

RSVP-TE (Resource Reservation Protocol-Traffic Engineering)
- RSVP-TE 是一种扩展版的资源预留协议（RSVP），专门用于流量工程（Traffic Engineering）。
- 它的主要功能是在网络中为特定的数据流预留资源（如带宽）。当数据流经过网络时，RSVP-TE 通过在路径上的每个路由器发送信令消息来预留带宽。
- 这种机制确保了数据流在网络中获得足够的资源，从而保证了服务质量（QoS）。
- 如果路径上的任何一个节点无法满足带宽要求，RSVP-TE 会尝试为数据流寻找并预留另一条路径。
IS-IS (Intermediate System to Intermediate System)
- IS-IS 是一种内部网关协议（IGP），用于在自治系统（AS）内的路由器之间传递路由信息。
- 它基于链路状态协议，路由器通过广播自身的链路状态信息来构建全局的网络拓扑视图。
- 每个路由器基于这张全局视图运行最短路径优先（SPF）算法，计算到达每个目的地的最佳路径。
- IS-IS 被广泛用于大型服务提供商网络中，因为它能够高效地处理复杂的网络拓扑，并且相对容易扩展。
BGP (Border Gateway Protocol)
- BGP 是一种外部网关协议（EGP），用于在不同的自治系统（AS）之间交换路由信息。
- BGP 是互联网的核心协议，通过它，各个网络运营商可以共享路由信息，确定跨越多个网络的最佳路径。
- 与IS-IS这样的内部网关协议不同，BGP 主要关注自治系统之间的路由，并考虑各种政策和路径属性来选择最佳路径。
- BGP 非常灵活，可以根据需要实施复杂的路由策略。

RSVP-TE 用于特定流量的资源预留，IS-IS 用于自治系统内部的路由，而 BGP 则用于自治系统之间的路由。

SDN WAN Control Infrastructure Is Complex¶

alt text

Figure 2 shows the main components of a canonical cSDN implementation. The essence of a cSDN controller is its TE algorithm. However ensuring the controller is highly-available and scalable requires more than a single controller programming all switches, but rather a multi-level hierarchy of controllers deployed across multiple data centers and edge locations [25, 27]. For modularity, topology discovery and traffic demand collection run as their own services, as does switch programming, and these services run on data center infrastructure. For example, in B4, the central controller runs on standard data center servers managed by a cluster management system [24, 60, 62, 65]; our edge controllers run on special servers that are co-located with routers and run a specialized SDN platform [5, 13] that is also managed by our cluster management system for infrastructure consistency.

图2显示了典型的集中式SDN (cSDN) 实现的主要组件。cSDN 控制器的核心是其流量工程（TE）算法。然而，要确保控制器具备高度的可用性和可扩展性，不能仅依赖一个控制器对所有交换机进行编程，而是需要在多个数据中心和边缘位置部署一个多层次的控制器层次结构 [25, 27]。

为了模块化，拓扑发现和流量需求收集作为独立服务运行，交换机编程同样如此，并且这些服务运行在数据中心的基础设施上。例如，在B4网络中，中央控制器运行在由集群管理系统 [24, 60, 62, 65] 管理的标准数据中心服务器上；我们的边缘控制器运行在与路由器共同部署的专用服务器上，并运行一个专用的SDN平台 [5, 13]，该平台也由我们的集群管理系统进行管理，以保持基础设施的一致性。

A separate Control Plane Network (CPN) connects controllers to the routers they control, thus avoiding a recursive dependency on network connectivity being already established. The CPN must be physically present in the same locations as all data plane devices, giving it a global footprint with several thousand devices in B4. The CPN requires its own routing for self-bootstrapping, and hence runs a minimal set of traditional routing protocols.

一个独立的控制平面网络 (CPN) 将控制器连接到其控制的路由器，从而避免了网络连接已经建立时出现的递归依赖。CPN 必须在所有数据平面设备所在的物理位置上存在，从而在全球范围内部署，例如B4网络中拥有数千台设备。CPN 需要自行引导的路由，因此运行了一组最小化的传统路由协议。

Crucially, each of these components — hardware and software alike — is on the critical path for high availability and thus engineered accordingly, e.g., with redundancy, replication, consensus, etc., becoming fairly complex systems in their own right. For example, in B4, both our edge and central SDN controllers are robustly replicated across distinct hardware and geo-diverse data centers, running Paxos for consistency and failover. In B4, these systems represent dozens of non-trivial microservices and millions of lines of code. Hardware components are likewise architected for resilience with redundant CPN switches and links.

至关重要的是，这些组件——无论是硬件还是软件——都位于高可用性的关键路径上，因此需要经过相应的工程设计，例如冗余、复制、一致性等，使得这些系统本身也变得相当复杂。例如，在B4网络中，我们的边缘和中央SDN控制器都在不同的硬件和地理分布的数据中心中进行可靠复制，并运行Paxos协议以实现一致性和故障转移。在B4网络中，这些系统包括数十个非简单的微服务和数百万行代码。硬件组件同样设计为具有弹性，配备了冗余的CPN交换机和链路。

Finally, to avoid a complete loss of connectivity in case of failure in the cSDN control stack, every SDN WAN that we are aware of continues to run IS-IS and BGP [11, 25, 27], programming forwarding entries at a lower priority to provide backup connectivity. These backup paths are insufficient for longer term operation because their placement is capacity unaware, and hence their use can lead to congestion, but they at least provide connectivity.

最后，为了避免在cSDN控制栈故障时完全丧失连接性，我们所知的每一个SDN WAN都继续运行IS-IS和BGP [11, 25, 27]，以较低优先级编程转发条目，以提供备用连接。这些备用路径不足以支持长期操作，因为它们的路径选择不考虑容量，因此使用它们可能导致拥塞，但至少能够提供基本的连接性。

In summary, as shown in Figure 2, modern WAN control planes span many non-trivial components, both external (shown in green) and on the box (shown in blue).

总而言之，如图2所示，现代WAN控制平面涵盖了许多复杂的组件，包括外部组件（绿色显示）和设备上的组件（蓝色显示）。

使用CPN消除递归依赖

在这一句中，“a recursive dependency on network connectivity being already established” 指的是在控制器和路由器之间如果没有一个独立的控制平面网络 (CPN) 连接，就会产生的一个循环依赖问题。

递归依赖 (Recursive Dependency)：这是指某个系统的部分依赖于自身的状态或另一个需要先满足的条件，形成一个循环。例如，如果一个控制器要控制路由器，但它需要先通过该路由器才能建立连接，那么就会产生一个递归依赖的问题。
网络连接已经建立 (Network Connectivity Being Already Established)：如果没有独立的控制平面网络，控制器和路由器之间的连接就依赖于数据平面网络本身。也就是说，控制器要控制的网络连接（数据平面）必须先存在，控制器才能与路由器通信。但如果这个网络连接本身还没有建立或者出现故障，控制器就无法与路由器通信，这样控制器又无法修复或建立这个网络连接，形成了一个死循环。

为什么需要避免递归依赖？

如果控制器和路由器的连接依赖于数据平面网络，那么一旦数据平面网络出现问题（例如未建立或故障），控制器将无法与路由器通信，也就无法修复网络问题。这种情况下，整个网络的可用性会大大降低，系统的恢复将变得非常困难。

CPN的设计

一个独立的控制平面网络 (CPN) 是专门为控制器与路由器之间的通信而设计的，完全独立于数据平面网络（即实际传输用户数据的网络）。

独立的物理网络或逻辑隔离
- 物理独立：CPN 通常是一个独立的物理网络，具有自己的物理线路和交换设备，与数据平面网络的基础设施分离。这意味着，即使数据平面网络出现问题，CPN 仍然能够继续正常运行。
- 逻辑隔离：在某些情况下，CPN 可能与数据平面共享物理基础设施，但通过逻辑隔离（如虚拟局域网 VLAN）实现独立操作。这种逻辑隔离确保了 CPN 的运行不受数据平面故障的影响。
专用路由和协议
- CPN 通常运行独立的路由协议，用于在控制器和路由器之间传输控制信息。这些协议通常简单、可靠，旨在保证控制信息能够在网络中快速传播。
- 例如，CPN 可能运行一些最小化的传统路由协议，以确保即使在复杂的网络环境中，控制器也能够找到并连接到所需的路由器。
控制器和路由器之间的直接通信
- 控制器直接连接到网络中的各个路由器 with the help of CPN，这意味着控制器可以直接发送指令给路由器，而不依赖数据平面的状态。
- 这种直接通信允许控制器在数据平面网络故障时，仍然能够执行关键操作，如重新配置路由、恢复网络连接、修复故障路径等。
网络自启动和恢复功能
- CPN 具有自启动和自恢复能力。当数据平面网络因故障而瘫痪时，CPN 依然可以工作，并且能够通过控制器向路由器发送恢复指令，从而帮助恢复数据平面网络的正常运行。
- CPN 本身通常会被设计为高度冗余和容错的系统，以确保其在极端情况下依然可以维持运行。

Impact on Availability: Examples from B4¶

External infrastructure introduces new failure modalities.

cSDN’s external infrastructure results in a loss of fate-sharing and thus an expanded space of possible failures: any subset of the control infrastructure can get disconnected from the data plane, preventing reconvergence. In such scenarios, the common strategy is to “fail static” [25, 27] but this is not a panacea since the network’s forwarding state becomes increasingly stale as the problem persists. In one incident, we failed static when a misconfiguration on a CPN switch led to a portion of the CPN being disconnected. This interacted poorly with an unrelated maintenance operation that was attempting to reintroduce (previously removed) capacity back into the network. Failing static prevented the restored capacity from being used leading to severe congestion that lasted for the entire duration that it took to diagnose and fix the CPN problem.

外部基础设施增加了漏洞的暴露面。图2中的每个组件都可能引入额外的漏洞，这些漏洞有可能导致网络中断。在B4网络中，我们的拓扑服务中的一个漏洞导致了提供给流量工程（TE）的拓扑信息不完整（即，缺少了几条链路），从而导致可用路径减少，造成严重的网络拥堵，并且在几分钟内丢包率超过50%。类似地，之前的研究报告中提到过安装转发条目的SDN编程器中的漏洞【31】、控制器备份实现中的漏洞【27】、网络状态收集基础设施中的漏洞【11】，以及控制平面网络（CPN）中的漏洞【19】。我们还经历过由于集中式SDN（cSDN）对数据中心管理系统的依赖而导致的中断。一次例行维护期间，缺少了一个与边缘控制器任务相关的集群管理配置标志，导致这些任务被错误地终止，进而引发了一次大范围的网络中断，持续了数小时。

Legacy Protocols Remain.

Retaining traditional protocols was perhaps not part of the original SDN vision. However, to mitigate(降低) risk in the early days of deployment, operators retained their well-known protocol-based control plane as a backup. 15+ years later we continue to experience SDN outages because of which we – and all other WAN operators that we are aware of – continue to maintain this backup. B4 thus runs IS-IS, BGP, and a form of FRR [1] in addition to cSDN. The main simplification compared to our protocol-based network is the removal of RSVP-TE, which is a valuable simplification but far from having eliminated our dependence on vendor protocols. In fact, in addition to the challenges of testing, configuring, and operating protocols, we must now contemplate potential interactions between two control paradigms that were never designed to coexist; e.g., in one published outage [31], when the SDN controller failed, the network fell back to IS-IS and BGP, but a misconfiguration of IGP weights led to severe congestion and packet loss, highlighting the burden on operators to master both TE and IS-IS configuration.

保留传统协议可能并非最初SDN愿景的一部分。然而，为了在早期部署阶段降低风险，运营商保留了他们熟知的基于协议的控制平面作为备份。15年多过去了，我们依然经历着SDN的中断，这使得我们——以及我们所知的所有其他广域网运营商——仍然在维护这种备份。因此，B4网络除了运行集中式SDN（cSDN）外，还运行IS-IS、BGP和一种形式的快速重路由（FRR）【1】。与我们基于协议的网络相比，主要的简化在于移除了RSVP-TE，这确实是一个有价值的简化，但远未能消除我们对厂商协议的依赖。实际上，除了测试、配置和操作协议的挑战外，我们现在还必须考虑两个从未设计为共存的控制范式之间可能的交互；例如，在一次已发布的中断事件中【31】，当SDN控制器失效时，网络回退到了IS-IS和BGP，但由于IGP权重的配置错误，导致了严重的拥塞和丢包，凸显了运营商必须同时掌握TE和IS-IS配置的负担。

alt text

In summary, the complexity of our cSDN control plane in B4 is a dominant contributor to our most challenging outages: Figure 3 shows that 52% of our 41 largest outages have a control-related root cause. We emphasize that our point here is not that we must eradicate external control infrastructure. On the contrary, such infrastructure is likely still needed for configuring routers, monitoring, upgrades, etc. Instead, our point is that we should avoid putting these components on the critical path for network availability.

总的来说，B4网络中cSDN控制平面的复杂性是我们最具挑战性的中断事件的主要原因：图3显示，在我们41次最大的中断事件中，52%是由控制相关问题引发的。我们在此强调的重点不是我们必须消除外部控制基础设施。相反，这种基础设施可能仍然需要用于路由器配置、监控、升级等。然而，我们的观点是，应该避免将这些组件置于网络可用性的关键路径上。

与控制有关的组件非常重要，它们最好不要放在关键路径上

Technologies that Enable a New Architecture¶

The past decade has brought two key developments which, taken together, enable an alternative routing architecture. The first is that network OS developers [23, 29, 66] have adopted Linux and mainstream container technologies [42] enabling operator-defined application code to be practically deployed on routers. The second is the emergence of a new generation of control and management APIs that enable 3rd-party code to interact with router internals in a manner that is both comprehensive and vendor-agnostic. While early protocols such as OpenFlow enabled access to a router’s forwarding table, other state and configuration was not easily accessible [33]. These new APIs go further in enabling general access to the router’s RIB, internal counters, configuration, etc. [8, 38, 40, 51, 57]. They allow operator and vendor code to coexist on the same platform, and allow operator code to be portable across platforms and router OSes.

过去十年中出现了两个关键发展，这两个发展结合在一起使一种替代的路由架构成为可能。首先是网络操作系统开发者【23, 29, 66】已经采用了Linux和主流容器技术【42】，使运营商定义的应用程序代码能够实际部署在路由器上。其次是新一代控制和管理API的出现，这些API使第三方代码能够以全面且与供应商无关的方式与路由器内部交互。虽然早期的协议（如OpenFlow）仅允许访问路由器的转发表，其他状态和配置却难以访问【33】。这些新的API更进一步，使得对路由器的路由信息库（RIB）、内部计数器、配置等的通用访问成为可能【8, 38, 40, 51, 57】。它们允许运营商代码和供应商代码在同一平台上共存，并允许运营商代码在不同平台和路由器操作系统之间的移植。

New Tech

Linux and mainstream container technologies (Docker)
New control and management APIs
Now, operator and vendor code can coexist on the same platform

In combination, these developments mean a network operator can now deploy custom code directly on routers rather than external infrastructure.

结合这些发展，意味着网络运营商现在可以将自定义代码直接部署在路由器上，而不依赖外部基础设施。

alt text