CrystalNet: Faithfully Emulating Large Production Networks¶

主要介绍了一种在公有云上运行的, 大规模且高保真的网络仿真器

Note

本文的内容很有借鉴与学习意义.

但是不适合用来做 DTC-Protocol-Lab, 更适合 DTCLab

作为PLC设计模范, 是很好的学习资料

TLDR¶

先让 gemini 读一遍, 发现其实不像前两篇 SimBricks 和 Phantom 那么强相关.

(1) 背景与动机

核心问题:
- 大型生产网络(如 Azure)极其复杂, 由于设备故障, 软件 Bug, 配置错误及人为失误, 极易发生服务中断
现有工具的局限:
- 硬件测试床: 规模太小, 无法通过集成测试发现复杂拓扑中的问题
- 形式化验证: 如 Batfish, 假设设备行为理想化, 无法发现设备固件本身的 Bug
- 小型仿真器: 如 Mininet, 难以扩展到云规模, 且难以处理异构设备的固件
统计数据:
- 在微软的网络事故中, 36% 源于软件 Bug, 27% 源于配置错误, 6% 源于人为失误

(2) CrystalNet 的核心设计

CrystalNet 是一个运行在公有云上的高保真网络仿真器, 旨在让操作员在生产环境变更前进行准确验证

高保真 (High Fidelity):
- 运行真实的设备固件(通过容器或虚拟机)
- 加载真实的生产配置
- 主要关注 Control Plane 的精确模拟, 而非 Data Plane 的性能
可扩展性 (Scalability):
- 设计为在公有云 VM 上运行, 通过 Overlay Network 跨 VM 连接
- 可以在几分钟内启动包含 5000 台设备的仿真网络
异构性支持 (Heterogeneity):
- PhyNet 容器层: 设计了一个统一的 "PhyNet" 容器层来模拟物理接口和连接, 屏蔽了不同厂商设备(容器/VM 固件)的底层差异
- 支持混合运行不同厂商的交换机 OS, 甚至可以接入真实的硬件设备

(3) 关键技术创新

虚拟链路 (Virtual Links):
- 使用 VXLAN 隧道技术构建虚拟链路, 支持跨公有云甚至跨互联网传输, 模拟以太网行为
安全的静态边界 (Safe Static Boundary):
- 为了解决无法模拟整个互联网或外部网络的问题, CrystalNet 引入了 "Speaker Devices" 来模拟边界
- 提出了 "Safe Static Boundary" 的概念和算法: 确保在仿真内部发生变动时, 外部 Speaker 不需要动态响应也能保证路由状态的正确性
- 对于 BGP 数据中心网络, 通过算法识别最小的安全边界, 可减少 90% 以上的资源消耗

Introduction¶

CrystalNet is a high-fidelity, cloud-scale network emulator in daily use at Microsoft. We built CrystalNet to help our engineers in their quest to improve the overall reliability of our networking infrastructure. A reliable and performant networking fabric is critical to meet the availability SLAs we promise to our customers.

It is notoriously challenging to run large networks like ours in a reliable manner [11, 13, 15, 31]. Our network consists of tens of thousands of devices, sourced from numerous vendors, and deployed across the globe. These devices run complex (and hence bug-prone) routing software, controlled by complex (and hence bug-prone) configurations. Furthermore, churn is ever-present in our network: apart from occasional hardware failures, upgrades, new deployments and other changes are always ongoing.

The key problem is that in such a large and complex environment, even small changes or failures can have unforeseen and near-disastrous consequences [16]. Worse yet, there are few tools at our disposal to proactively gauge the impact of failures, bugs or planned changes in such networks.

Small hardware testbeds [1, 2] are used to unit-test or stress-test new network devices before they are added to the network. While useful, these cannot reveal problems that arise from complex interactions in a large topology.

Network verification tools such as Batfish [13] ingest topology and configuration files, and compute forwarding tables by simulating the routing protocols. These forwarding tables can be analyzed to answer a variety of reachability questions. However, Batfish cannot account for bugs in routing software. Nor can it account for subtle interoperability issues that result from slight differences in different vendor's implementation of the same routing protocol. In other words, Batfish is not "bug compatible" with production network. In our network nearly 36% of the problems are caused by such software errors (Table 1). Note that there is no way to make Batfish bug compatible – often, the bugs are unknown to the device vendors themselves until they manifest under certain conditions. Also, Batfish presents a very different workflow to the operators of the network. This means it is not suitable for preventing human errors, which are responsible for a non-negligible 6% of the outages in our network.

What we needed was a large scale, high-fidelity network emulator that would allow us to accurately validate any planned changes, or gauge the impact of various failure scenarios. Small-scale network emulators such as MiniNet [18] or GNS3 [3] are useful, but have many deficiencies (see §10), and do not scale to the level required to emulate large public cloud networks.

To address this gap, we built CrystalNet, a highly scalable, and high-fidelity network emulator. High fidelity means that the emulator accurately mimics the behavior of the production network, especially in the control plane. Further, it allows the operators to use the exact same tools and workflows that they would on a production network.

We do not claim full fidelity - that requires creating a complete replica of the production network, which is infeasible. Thus, it is not our goal to faithfully emulate the network dataplane (latency, link bandwidth, traffic volume etc.). Our focus is on the control plane.

To accurately mimic the control plane, CrystalNet runs real network device firmwares in virtualized sandboxes (e.g., containers and virtual machines). Such VMs or containers are available from most major router vendors. We inter-connect the device sandboxes with virtual links to mimic the real topology. It loads real configurations into the emulated devices, and injects real routing states into the emulated network.

Operators can interact (i.e. change, upgrade, or monitor) with the emulated network using the same tools and scripts they use to interact with the production network. They can inject packets in the emulated network and monitor their paths. CrystalNet can even be extended to include real hardware in the emulated network.

Our network engineers use CrystalNet on a daily basis. They have caught several errors in their upgrade plans which would have been impossible to catch without an emulator like CrystalNet. We report some of their experiences in §7.

CrystalNet is scalable and cost-effective. To emulate a network of 5,000 devices, we need just a few minutes and 500 VMs (4 cores, 8GB RAM). Such VMs retail for USD 0.20/hour each, so the total cost of emulating such a large network with CrystalNet is USD 100/hour. This is miniscule compared to the cost of a network outage.

Three key features allow CrystalNet to scalably emulate large, heterogeneous networks, which are also our major contributions in this paper. First, CrystalNet is designed to run from ground-up in public cloud. If necessary, CrystalNet can even simultaneously use multiple public and private clouds. This allows CrystalNet to scale to levels well beyond those possible with MiniNet and GNS3. Since VM failures are likely to happen in any large-scale deployment, CrystalNet allows saving and restoring emulation state, and quick incremental changes to the emulation.

Second, CrystalNet can accommodate a diverse range of router software images from our vendors. The router images are either standalone VMs or in form of Docker containers. To accommodate and manage them uniformly, we mock-up physical network with homogeneous containers and run heterogeneous device sandboxes on top of the containers’ network namespace. CrystalNet also allows our engineers to access the routers in a standard manner via Telnet or SSH. CrystalNet can also include on-premise hardware devices in the emulated network in a transparent manner. This requires careful traversal of NATs and firewalls in the path.

Third, CrystalNet accurately mock-up external networks transparently to emulated networks. An emulated network has to have a boundary, beyond which there are no emulated devices. Apart from resource constraints (one cannot emulate the whole Internet), the fact is that we cannot obtain the configurations and device firmware from devices that are outside our administrative control (e.g., our upstream ISP). We use lightweight passive agent that mimics the announcements emulated devices would receive from beyond the boundary. Since the agents do not respond to dynamics in the emulated network, we ensure the correctness of the results by identifying and searching for a safe boundary (§5). Computing the right boundary can also save resources: indeed, it can cut the cost of emulation by 94-96% while maintaining high fidelity (§8.4).

Before describing CrystalNet in more details, we first discuss the outages in our network over the last two years.

CrystalNet 是一款高保真(high-fidelity), 云规模(cloud-scale)的网络仿真器, 目前已在微软内部投入日常使用.

我们构建 CrystalNet 旨在辅助工程师提升网络基础设施的整体可靠性, 构建可靠且高性能的网络基础架构, 对于满足我们向客户承诺的可用性服务等级协议(SLA)至关重要.

众所周知, 以可靠的方式运营像我们这样的大规模网络极具挑战性. 我们的网络由数以万计的设备组成, 这些设备采购自众多供应商, 并部署在全球各地. 这些设备运行着复杂的路由软件, 并由复杂的配置所控制. 此外, 网络中的 churn 无处不在: 除了偶发的硬件故障外, 升级, 新部署和其他变更始终在进行中.

核心问题在于, 在如此庞大且复杂的环境中, 即便是微小的变更或故障, 也可能产生始料未及且近乎灾难性的后果. 更糟糕的是, 我们缺乏有效的工具来主动评估故障, 软件缺陷或计划变更在此类网络中的影响.

(1) 小型硬件测试床

常用于在新网络设备上线前进行单元测试/压力测试

尽管这些测试行之有效, 但它们无法揭示在大规模拓扑中因复杂交互而引发的问题.

(2) 诸如 Batfish 等网络验证工具

通过摄入拓扑和配置文件, 模拟路由协议来计算转发表. 通过分析这些转发表, 可以回答各种可达性问题. 然而, Batfish 无法捕捉路由软件中的 bugs.

它也无法解决因不同供应商对同一路由协议的实现存在细微差异而导致的微妙互操作性问题.

换言之, Batfish 无法做到与生产网络 bug compatible.

在我们的网络中, 近 36% 的问题是由此类软件错误引起的(见表 1). 值得注意的是, 我们无法让 Batfish 做到 Bug 兼容——因为通常情况下, 设备供应商自己也只有在特定条件下才会发现这些 Bug. 此外, Batfish 为网络运维人员提供的工作流截然不同. 这意味着它不适用于预防人为错误, 而在我们的网络中断事故中, 人为错误占比达到了不可忽视的 6%.

我们所需要的是一种大规模, 高保真的网络仿真器, 使我们能够准确验证任何计划中的变更, 或评估各种故障场景的影响.

(3) 诸如 MiniNet 或 GNS3 等小规模网络仿真器

虽然有用, 但存在诸多缺陷(见 §10), 且无法扩展到仿真大型公有云网络所需的级别.

为了填补这一空白, 我们构建了 CrystalNet, 这是一个高度可扩展且高保真的网络仿真器.

所谓高保真, 是指仿真器能够准确模拟生产网络的行为, 特别是在控制平面(control plane)方面.

此外, 它允许运维人员使用与生产网络完全相同的工具和工作流.

我们并不主张 full fidelity! 这需要创建生产网络的完整副本, 实际上是不可行的.

因此: 忠实地仿真 data-plane(如: 延迟, 链路带宽, 流量)并非我们的目标, 我们的重点在于 control-plane.

为了准确模拟控制平面, CrystalNet 在虚拟沙盒(如容器和虚拟机)中运行真实的网络设备固件:

大多数主流路由器供应商都提供此类 VM 或容器镜像
通过 VXLAN 互联这些设备沙盒以模拟真实拓扑
将真实的配置加载到仿真设备中, 并向仿真网络注入真实的路由状态

运维人员可以使用与生产网络交互时相同的工具和脚本, 与仿真网络进行交互(即变更, 升级或监控).

他们可以向仿真网络注入数据包并监控其路径. CrystalNet 甚至可以扩展以将真实硬件包含在仿真网络中.

我们的网络工程师每天都在使用 CrystalNet. 他们已经发现了升级计划中的若干错误, 如果没有 CrystalNet 这样的仿真器, 这些错误是不可能被发现的, 我们在 §7 中报告了部分使用经验.

CrystalNet 具有可扩展性和成本效益. 仿真一个包含 5,000 台设备的网络仅需几分钟时间和 500 台虚拟机(4 核, 8GB 内存). 此类虚拟机的零售价为每台 0.20 美元/小时, 因此使用 CrystalNet 仿真如此大规模网络的总成本为 100 美元/小时. 与网络中断的代价相比, 这微不足道.

三个关键特性使得 CrystalNet 能够大规模仿真异构网络, 这也是本文的主要贡献:

基于公有云设计:
- 首先, CrystalNet 从设计之初就是为了在公有云中运行. 如有必要, CrystalNet 甚至可以同时使用多个公有云和私有云
- 这使得 CrystalNet 的扩展能力远超 MiniNet 和 GNS3
- 由于 VM 故障在大规模部署中在所难免, CrystalNet 允许保存和恢复仿真状态, 并支持对仿真进行快速增量变更
兼容异构设备软件:
- 其次, CrystalNet 能够兼容供应商提供的多种路由器软件镜像
- 这些路由器镜像既可以是独立 VM, 也可以是 Docker 容器形式
- 为了统一容纳和管理它们, 我们利用同构容器模拟物理网络, 并在容器的网络命名空间之上运行异构设备沙盒
- CrystalNet 还允许工程师通过 Telnet 或 SSH 以标准方式访问路由器
- 此外, CrystalNet 能够以透明的方式将本地(on-premise)硬件设备包含在仿真网络中
- 这需要仔细处理路径中的 NAT 和防火墙穿越问题
精确且透明地模拟外部网络:
- 第三, CrystalNet 能够对仿真网络透明地精确模拟外部网络
- 仿真网络必须有一个边界, 边界之外没有仿真设备
- 除了资源限制(无法仿真整个互联网)外, 现实情况是: 我们无法获取控制范围之外(e.g. 上游 ISP)设备的配置和固件
- 我们使用轻量级被动代理(passive agent)来模拟仿真设备从边界外接收到的通告
- 这些代理不响应仿真网络内的动态变化, 我们通过识别和搜索安全边界(safe boundary)来确保结果的正确性
- 计算正确的边界还可以节省资源: 实际上, 它可以在保持高保真度的同时, 将仿真成本降低 94-96%.

Motivation¶

Table 1 shows a summary of O(100) network incidents in our network and their root causes for the past two years. The categories are broad, and somewhat loosely defined; what matters are the details of individual scenarios, as below.

表 1 汇总了过去两年中我们网络发生的 O(100) 起网络事故及其根本原因. 虽然分类较为宽泛且定义略显松散, 但重要的是以下各个场景的细节.

Software bugs: This category includes incidents caused by issues in device firmware, and bugs in our own network management tools, although, most incidents are due to bugs in device firmware.

Examples of bugs in our own automation tools include an unhandled exception that caused a tool to shut down a router instead of a single BGP session.

Device software issues come in many forms. Some are outright bugs: for example, new router firmware from a vendor erroneously stopped announcing certain IP prefixes. In another case, ARP refreshing failed when peering configuration was changed.

Another set of problems arise out of ambiguity, rather than bugs. Different versions of the network devices from the same vendor usually have slightly different configuration definitions. For instance, a vendor changed the format of ACLs in the new release, but neglected to document the change clearly. As a result, the old configuration files were processed incorrectly by switches running the new firmware.

Devices often exhibit vendor-dependent behavior in the implementation of otherwise standard protocols/features, e.g., how to select BGP paths for aggregated IP prefixes, or how to process routes after FIB is full, etc. Such corner cases are often not well documented. For example, Figure 1 shows a simplified version of a problem we saw in production. IP prefixes P1 and P2 belong to router R1 with AS number "1". When higher layer routers R6 and R7 get the announcements of these two prefixes, they want to aggregate them into a single one (P3). However, R6 and R7 are from different vendors, and they have different behaviors to select the AS path of P3: R6 learns different paths for P1 and P2 from R2 (with AS path { 2, 1 } ) and R3 (with AS path { 3, 1 } ) and it selects one of them and appends its own AS number before announcing P3 to R8 ( { 6, 2, 1 } in this example); R7 faces a similar situation, but it does not select any paths and only puts its AS number as the AS path in the announcement of P3 to R8. As a result, R8 always prefers to send packets for P3 to R7 because it thinks R7 has a lower cost, causing sever traffic imbalance.

Sometimes, different system components that perform correctly in individual capacity, do not interact well, especially after a change. For instance, a software load balancer owned a /16 IP prefix. However, it was asked to release some IP blocks in the prefix and give them to other load balancers. It then broke the /16 IP prefixes into 256 × /24 IP blocks and announced the blocks (about 100) that it held. However, a router connected to the load balancer was short of FIB space and dropped many of these announcements, causing traffic black holes.

Note that these errors escaped the fairly rigorous unit testing done by our vendors as well as our own pre-certification checks. While more rigorous unit testing is always helpful, it is impossible to cover the vast range of possible inputs and conditions that occur in production environment. Full-fledged integration testing would be impractically resource intensive – unless one used a high fidelity emulator like CrystalNet. We do not claim that CrystalNet can uncover all bugs – only that by letting operators test router firmwares, tools and planned changes in a high-fidelity emulation would reduce the possibility of such bugs impacting production networks.

We note once again that network verification systems [1113, 15, 23, 30] cannot account for the impact of such bugs, since they rely on analyzing configurations, and assume ideal, bug-free behavior from network components. One may think that the systems can be updated to model the bugs – but many of the bugs are "unknown" until they manifest themselves in production networks. 1 Moreover, such systems are even less effective when the network has components like software load balancers, whose behavior is "baked" into custom software, rather than driven by configurations and governed by standards. One can never fully and accurately model the behavior of such components.

alt text

软件缺陷 (Software bugs):

此类事故包括由设备固件问题以及我们自身网络管理工具中的缺陷所引发的事故, 尽管大多数事故源于设备固件中的缺陷.

我们自身自动化工具中的缺陷示例包括: 一个未处理的异常导致工具错误地关闭了路由器, 而非仅关闭单个 BGP 会话.

设备软件问题形式多样. 有些是彻头彻尾的缺陷: 例如, 某供应商的新路由器固件错误地停止了通告某些 IP 前缀. 在另一个案例中, 当对等互连(peering)配置发生变更时, ARP 刷新失败.

另一类问题源于歧义而非缺陷. 同一供应商的不同版本网络设备通常具有略微不同的配置定义. 例如, 某供应商在新版本中更改了 ACL 的格式, 但忽略了对该变更的清晰记录. 结果, 旧的配置文件被运行新固件的交换机错误地处理.

alt text

在实现标准协议/特性时, 设备通常会表现出依赖于供应商(vendor-dependent)的行为, 例如如何为聚合 IP 前缀选择 BGP 路径, 或在 FIB 满后如何处理路由等. 此类边界情况(corner cases)通常缺乏完善的文档. 例如, 图 1 展示了我们在生产环境中遇到的一个简化版问题. IP 前缀 P1 和 P2 属于 AS 号为"1"的路由器 R1. 当上层路由器 R6 和 R7 收到这两个前缀的通告时, 它们试图将其聚合为单个前缀(P3). 然而, R6 和 R7 来自不同供应商, 它们在选择 P3 的 AS 路径时表现不同: R6 从 R2(AS 路径 {2, 1})和 R3(AS 路径 {3, 1})学习到 P1 和 P2 的不同路径, 它选择其中一条并在通告 P3 给 R8 之前附加自己的 AS 号(本例中为 {6, 2, 1}), 而 R7 面临类似情况, 但它不选择任何路径, 仅将其 AS 号作为 AS 路径通告给 R8. 结果, R8 总是倾向于将 P3 的数据包发送给 R7, 因为它认为 R7 开销更低, 从而导致严重的流量不平衡.

有时, 在独立能力下运行正常的各系统组件, 在交互时(尤其是在变更后)表现不佳. 例如, 某软件负载均衡器拥有一个 /16 IP 前缀. 然而, 它被要求释放该前缀中的部分 IP 块给其他负载均衡器. 于是它将 /16 IP 前缀拆分为 256 个 /24 IP 块, 并通告其持有的块(约 100 个). 然而, 连接到该负载均衡器的路由器缺乏足够的 FIB 空间, 丢弃了许多此类通告, 导致流量黑洞.

值得注意的是, 这些错误逃过了供应商相当严格的单元测试以及我们自身的预认证检查. 虽然更严格的单元测试总是有益的, 但要覆盖生产环境中出现的各种可能的输入和条件是不可能的.

除非使用像 CrystalNet 这样的高保真仿真器, 否则进行全面的集成测试将耗费极其巨大的资源.

我们并不宣称 CrystalNet 能发现所有 Bug, 只是说, 让运维人员在高保真仿真环境中测试路由器固件, 工具和计划的变更, 将降低此类 Bug 影响生产网络的可能性.

我们再次指出, 网络验证系统 [11-13, 15, 23, 30] 无法考量此类 Bug 的影响, 因为它们依赖于分析配置, 并假设网络组件具有理想的, 无 Bug 的行为.

人们可能认为可以更新系统以对 Bug 进行建模——但在许多情况下, Bug 直到在生产网络中显现之前都是"未知的".

此外, 当网络包含诸如软件负载均衡器等组件时, 此类系统甚至更无效, 因为这些组件的行为是"固化"在定制软件中的, 而非由配置驱动或由标准管辖. 人们永远无法完全且准确地对这些组件的行为进行建模.

Configuration bugs: Network configurations are not just for controlling behavior of routing protocols – their design must also account for issues like forwarding table capacity, CPU load of devices, IPv4 shortage and reuse, security and access control, etc. Taken together, this makes our network configuration policies quite complicated. As a result, 27% of outages result from what can be termed as configuration errors, such as missing or incorrect ACLs violate security, overlapping IP assignments, incorrect AS number etc.

Our devices are initially configured automatically, using a configuration generator similar to [9, 28]. Most of the incidents in this category were triggered due to ad-hoc changes to configurations during failure mitigation or planned updates. By testing such changes with CrystalNet, the possibility of such errors impacting production networks can be reduced.

配置缺陷 (Configuration bugs):

网络配置不仅仅用于控制路由协议的行为, 其设计还必须考虑诸如转发表容量, 设备 CPU 负载, IPv4 短缺与复用, 安全性及访问控制等问题

综上所述, 这使得我们的网络配置策略相当复杂. 结果是, 27% 的中断源于可被称为配置错误的因素, 例如遗漏或错误的 ACL 违反安全性, 重叠的 IP 分配, 错误的 AS 号等

我们的设备初始是使用类似于 [9, 28] 的配置生成器自动配置的. 该类别中的大多数事故都是由于在故障缓解或计划更新期间对配置进行了临时(ad-hoc)变更所触发的. 通过使用 CrystalNet 测试此类变更, 可以减少此类错误影响生产网络的可能性

Human errors: We define "human errors" as those manual actions that clearly mismatch their intention, resulting in an error of some kind. e.g. mistyping "deny 10.0.0.0/20" as "deny 10.0.0.0/2". Human errors surprisingly cause a nonnegligible portion (6%) of the incidents. One might argue that this is due to carelessness and cannot be remedied. However, after conversations with experienced operators, we found a more important systematic reason is that operators do not have a good environment to test their plans and practice their operations with actual device command interfaces. CrystalNet can provide such an environment. Network verification systems like Batfish present a different workflow than what the operators would carry out in practice, and hence cannot reduce the occurrence of such errors.

Summary: The analysis of these incidents underscores the fact that numerous different types of bugs and errors can affect large, complex networks. Testing and planning with a high-fidelity network emulator like CrystalNet can catch many of these bugs and errors before they disrupt production networks; while traditional network verification systems offer much more limited succor.

人为错误 (Human errors):

我们将"人为错误"定义为那些明显与其意图不符的手动操作, 从而导致某种类型的错误

例如将"deny 10.0.0.0/20"误输为"deny 10.0.0.0/2". 令人惊讶的是, 人为错误导致了不可忽视的一部分(6%)事故. 人们可能会争辩说这是由于粗心大意且无法补救

然而, 在与经验丰富的运维人员交流后, 我们发现一个更重要的系统性原因是: 运维人员缺乏一个良好的环境来测试他们的计划并通过实际的设备命令接口练习操作

CrystalNet 可以提供这样的环境. 像 Batfish 这样的网络验证系统呈现的工作流与运维人员在实践中执行的工作流不同, 因此无法减少此类错误的发生

总结 (Summary): 对这些事故的分析强调了一个事实, 即多种不同类型的缺陷和错误可能会影响大型复杂网络. 使用像 CrystalNet 这样的高保真网络仿真器进行测试和规划, 可以在这些缺陷和错误破坏生产网络之前将其捕获; 而传统的网络验证系统提供的帮助则要有限得多

CrystalNet Design Overview¶

In this section, we describe the design goals, the overall architecture and the programming interfaces of CrystalNet.

3.1 Design goals¶

The ultimate goal of CrystalNet is to provide high fidelity network emulations to operators. To meet this goal, CrystalNet has three key properties:

Ability to scale out using public clouds: Resources required for faithful emulation of large networks are well beyond the capacity of a single server, or even a small cluster of servers. For example, a single Microsoft datacenter can consist of thousands of routers. Emulating each router requires non-trivial amounts of CPU, RAM etc. Even more resources are needed if we consider middleboxes and interdatacenter scenarios. Computing resources at this scale are only available from public cloud providers in form of VMs. Thus, to ensure that there is no an upper limit on the scale of emulated networks, CrystalNet must be able to run in a distributed manner, over a large number of VMs in a public cloud environment. Everything should easily scale out – e.g. to double the emulated network size, the operators simply need to allocate twice the computing resources.

Ability to transparently mock up physical networks: A switch OS assumes it runs on top of a physical switch that has multiple network interfaces and connects to neighboring devices. Management tools assume each network device can be visited via an IP address with Telnet or SSH. CrystalNet must create virtual network interfaces, virtual links and virtual management networks that are transparent to switch OS and management tools, so that the latter can work without modifications.

Ability to transparently mock up external networks: An emulated network always has a boundary – we cannot emulate the whole Internet. This is not just a resource issue; the key problem is that operators cannot obtain OS images or configurations of devices outside their management domain. CrystalNet must accept the fact that boundary exists, and ensure high fidelity even though devices outside the boundary are not emulated.

In addition, we also desire properties such as failure resilience and cost efficiency. Next, we describe how the design of CrystalNet achieves these goals.

CrystalNet 的终极目标是为运维人员提供高保真的网络仿真. 为实现这一目标, CrystalNet 具备三个关键特性:

利用公有云进行横向扩展的能力:
- 对大型网络进行忠实仿真所需的资源远远超出了单台服务器甚至小型服务器集群的容量
  - 例如, 单个微软数据中心可能包含数千台路由器. 仿真每台路由器都需要不可忽略的 CPU, 内存等资源
  - 如果考虑中间件(middleboxes)和跨数据中心场景, 所需的资源会更多
- 这种规模的计算资源只有公有云提供商以虚拟机的形式才能提供
  - 因此, 为了确保仿真网络的规模没有上限, CrystalNet 必须能够在公有云环境中, 通过大量虚拟机以分布式的方式运行
  - 一切都应易于横向扩展: 要将仿真网络规模扩大一倍, 运维人员只需分配两倍的计算资源即可
透明模拟物理网络的能力:
- Switch OS 假设其运行在具有多个网络接口并连接到邻近设备的物理交换机之上
- 管理工具假设每个网络设备都可以通过 IP 地址及其 Telnet 或 SSH 端口进行访问
- CrystalNet 必须创建对交换机 OS 和管理工具透明的虚拟网络接口, 虚拟链路和虚拟管理网络, 以便后者无需修改即可工作
透明模拟外部网络的能力:
- 仿真网络总是存在边界 -- 我们无法仿真整个互联网
- 这不仅是资源问题; 核心问题在于, 运维人员无法获取其管理域之外(例如上游 ISP)设备的操作系统镜像或配置
- CrystalNet 必须接受边界存在的事实, 并确保即使边界外的设备未被仿真, 也能保持高保真度

此外, 我们还期望具备故障恢复能力和成本效益等属性. 接下来, 我们将描述 CrystalNet 的设计如何实现这些目标

3.2 Architecture¶

Figure 2 shows the high-level architecture of CrystalNet. The orchestrator is the “brain” of CrystalNet. It reads the information of production networks, provisions VMs (e.g., VM A) on clouds, starts device virtualization sandboxes (e.g., T1) in the VMs, creates virtual interfaces inside the sandboxes, builds overlay networks among the sandboxes, and introduces some external device sandboxes (e.g., B1) to emulate external networks. With aggressive batching and parallelism, the orchestrator runs on a single commodity server and easily handles O(1000) VMs.

CrystalNet is easy to scale-out. The overlay network ensures that that emulation can run on top of any VM clusters (with sufficient resources) without any modifications.

The emulated network in CrystalNet is transparent. Each device’s network namespace has the same Ethernet interfaces as in the real hardware; the interfaces are connected to remote ends via virtual links which transfer Ethernet packets just like real physical links; and the topology of the overlay network is identical with the real network it emulates (§4). Therefore, the device firmware cannot distinguish whether it is running inside a sandbox or on a real device. In addition, CrystalNet creates a management overlay network which connects all devices and jumpbox VMs. Management tools can run inside the jumpboxes and visit the devices in the same way as in production.

The emulation boundary of CrystalNet is transparent. The external devices for emulating external networks provide the same routing information as in real networks. Also, as discussed in §5, the boundary is carefully selected, so that the state of the emulated network is identical to real networks even if the emulated network is under churn.

The emulated network is highly available, because VMs are independently set up – a VM does not need to know the setup of any other VMs. Thus, the orchestrator can easily detect and restart a failed VM.

CrystalNet achieves the cost efficiency by putting multiple devices on each VM, and picking the right devices to emulate rather than blindly emulate the entire network (§5.2).

图 2 展示了 CrystalNet 的高层架构:

alt text

编排器(Orchestrator)是 CrystalNet 的"大脑":

它读取生产网络的信息, 在云端配置虚拟机(例如 VM A), 在虚拟机中启动设备虚拟化沙盒(例如 T1), 在沙盒内创建虚拟接口, 在沙盒之间构建覆盖网络(overlay network), 并引入一些外部设备沙盒(例如 B1)以仿真外部网络. 通过激进的批处理和并行化, 编排器运行在单台商用服务器上, 即可轻松处理 O(1000) 数量级的虚拟机.

CrystalNet 易于横向扩展. 覆盖网络确保仿真可以在任何(资源充足的)虚拟机集群上运行, 而无需任何修改.

CrystalNet 中的仿真网络是透明的. 每个设备的网络命名空间拥有与真实硬件相同的以太网接口; 这些接口通过虚拟链路连接到远端, 像真实物理链路一样传输以太网数据包; 且覆盖网络的拓扑与其实际仿真的真实网络完全相同(见 §4). 因此, 设备固件无法区分它是运行在沙盒内还是真实设备上.

此外, CrystalNet 创建了一个管理覆盖网络, 连接所有设备和跳板机(jumpbox)虚拟机. 管理工具可以在跳板机内运行, 并以与生产环境相同的方式访问设备.

CrystalNet 的仿真边界是透明的. 用于仿真外部网络的外部设备提供与真实网络相同的路由信息. 此外, 如 §5 所述, 边界经过精心选择, 以确保即使仿真网络处于变动(churn)中, 仿真网络的状态也能与真实网络保持一致.

仿真网络具有高可用性, 因为虚拟机是独立设置的 -- 一个虚拟机不需要知道其他任何虚拟机的设置. 因此, 编排器可以轻松检测并重启故障的虚拟机.

CrystalNet 通过在每个虚拟机上放置多个设备, 并挑选合适的设备进行仿真而不是盲目仿真整个网络(§5.2), 从而实现了成本效益.

3.3 CrystalNet APIs¶

The orchestrator exposes an API that operators use to configure, create, and delete emulations, and also to run various tests and observe network state for validation. The API, shown in Table 2, is inspired by the validation workflows which network operators desired to run.

Figure 3 illustrates the typical workflow of a network configuration update. First, Prepare is called to take a snapshot of the production environment, spawn VMs and feed those as the input into Mockup. Prepare includes functionality to get the necessary topology information, device configurations, and boundary route announcements (see §5), and VM planning based on topology. Mockup creates the virtual network topology (§4) and the emulation boundary (§5), and starts the emulated device software.

After Mockup, CrystalNet is ready for testing the update steps. At each step, operators can choose to apply significant changes like booting a new device OS or updating the whole configuration with Reload, or use existing tools for incremental changes via the management plane (§4).

Next, the operators can pull the emulation state (e.g. routing tables at each device) using monitoring APIs, as well as their own tools, to check whether the changes they made had the intended effect. CrystalNet also supports packet-level telemetry [32] for this purpose. Operators specify the packets to be injected and CrystalNet injects them with a pre-defined signature. All emulated devices capture all seen packets, filter and dump traces based on the signature. These traces can be used for analyzing network behavior.

With the ability to obtain routing tables, packet traces and the ability to login to emulated devices and check device status (see Table 2), operators using CrystalNet can validate an emulated network using their preferred methodologies 2 , e.g. injecting test traffic, verifying routing tables with reactive data plane verification tools [22], etc. If the results are as expected, operator can move onto the next step. Otherwise, operators revert current update with Reload, fix the bugs and try again. This process repeats until all update steps are validated. In the end, Destroy is called to release VMs.

CrystalNet also offers several helper APIs such as List all emulated devices, Login to a device, etc. We omit the details.

The key part of CrystalNet is to Mockup a high-fidelity environment that supports this unified API set and is cost-effective. We discuss it in the next two sections.

编排器暴露了一组 API, 运维人员利用这些 API 来配置, 创建和删除仿真, 以及运行各种测试并观察网络状态以进行验证. 表 2 中展示的 API 灵感来源于网络运维人员希望运行的验证工作流.

图 3 展示了网络配置更新的典型工作流:

alt text

(1) 首先, 调用 Prepare 以获取生产环境的快照, 生成虚拟机, 并将其作为输入传给 Mockup

Prepare 包含获取必要的拓扑信息, 设备配置和边界路由通告(见 §5)以及基于拓扑进行虚拟机规划的功能

(2) Mockup 创建虚拟网络拓扑和仿真边界, 并启动仿真设备软件

(3) Mockup 之后, CrystalNet 即可用于测试更新步骤

在每一步中, 运维人员可以选择应用重大变更, 例如启动新的设备操作系统或使用 Reload 更新整个配置, 或者通过管理平面使用现有工具进行增量变更

(4) 接下来, 运维人员可以使用监控 API 以及他们自己的工具来提取仿真状态(例如每个设备的路由表), 以检查他们所做的变更是否产生了预期的效果

CrystalNet 还支持数据包级遥测(Packet-level telemetry) [32] 以用于此目的:

运维人员指定要注入的数据包
CrystalNet 以预定义的签名注入这些数据包
所有仿真设备都会捕获所有可见的数据包, 并根据签名过滤和转储追踪数据. 这些追踪数据可用于分析网络行为

凭借获取路由表, 数据包追踪数据的能力, 以及登录仿真设备检查设备状态的能力(见表 2), 使用 CrystalNet 的运维人员可以使用他们首选的方法来验证仿真网络, 例如注入测试流量, 使用反应式数据平面验证工具验证路由表 [22] 等. 如果结果符合预期, 运维人员可以进入下一步. 否则, 运维人员使用 Reload 回滚当前更新, 修复 Bug 并重试. 此过程重复进行, 直到所有更新步骤都通过验证. 最后, 调用 Destroy 释放虚拟机.

CrystalNet 还提供了几个辅助 API, 如 List(列出所有仿真设备), Login(登录设备)等. 我们省略了细节.

CrystalNet 的关键部分是 Mockup(模拟)出一个支持这套统一 API 且具有成本效益的高保真环境. 我们将在接下来的两节中对此进行讨论.

Mock Up Physical Networks¶

4.1 Heterogeneous network devices¶

CrystalNet supports various OSes and software running on network devices. We focus on switches in our datacenter and WAN networks, which include three of the largest switch vendors (referred as CTNR-A, VM-B and VM-A), and an open source switch OS (CTNR-B). CrystalNet is designed to run transparently with these heterogeneous software systems, be extensible to other device software, and provide unified APIs (§3.3) for users.

CrystalNet chooses containers as the basic format for isolating devices. Containers isolate the runtime library with less overhead than VM, run well inside VMs on clouds, and, more importantly, isolate virtual interfaces of multiple devices to avoid naming conflicts. We use Docker engine to manage containers. We address challenges of running heterogeneous software, as explained below.

A unified layer for connections and tools. CrystalNet APIs must work for all devices we want to emulate. However, the heterogeneous device software is packed into different blackbox images by vendors. It is daunting, sometimes infeasible, to re-implement the APIs for each device and ensure consistent behavior. Another engineering challenge is that most containerized switch OS must boot with interfaces already present, while virtual interfaces can only be put into a container after the Docker container boots.3

To address this, we design a unified layer of Physical Network, or PhyNet containers (Figure 4), whose runtime binaries are decoupled from the devices being tested. This layer of containers hold all the virtual interfaces and are connected as the target topology. We place common tools, like Tcpdump, packet injection and pulling scripts, in PhyNet containers. Most CrystalNet APIs are then implemented for these PhyNet containers, instead of being re-implemented for each device. 4 Later, we boot the actual device software with the corresponding network namespace. Thus, the device software runs without any code changes – just like in the real life, they start with the physical interfaces already existing. Even if the software reboots or crashes, the virtual interfaces and links remain. The overhead of running PhyNet containers, which exists only to hold network namespaces, is negligible.

VM-based devices. While some vendors offer containerized images, others, like VM-B and VM-A, offer only VM images of their switch software. We cannot run VM-based device image directly on clouds, because public clouds cannot attach hundreds of virtual interfaces to a VM. In addition, we need to connect these VM-based devices with other containers, and maintain the PhyNet container layer.

Our solution is to pack the VM images, a KVM hypervisor binary, and a script that spawns the device VM, into a container image. In other words, we run the device VM inside a container on the cloud VMs. This requires nested VM feature on clouds. This feature is available on Microsoft Azure, as well as some other public clouds. In absence of this feature, CrystalNet can provision bare-metal servers for VM-based devices instead.

Real hardware. Finally, CrystalNet also allows operators to connect real hardware into the emulated topology. For example, CrystalNet can mock up a full network with one or more devices replaced by the real hardware. This allows us to test the hardware behavior in a much more realistic environment than the traditional stand-alone testing. Each real hardware switch is connected to a “fanout” switch. The “fanout” switch tunnels each port to a virtual interface on a server. These virtual interfaces are managed by a PhyNet container and are bridged with virtual links (see § 4.2) connecting the CrystalNet overlay.

By introducing PhyNet containers, CrystalNet is able to treat devices identically, regardless of whether they run in containers, VMs or as true physical devices, from the management viewpoint.

CrystalNet 支持运行在网络设备上的多种操作系统和软件. 我们主要关注数据中心和广域网(WAN)中的交换机, 其中包括三家最大的交换机供应商(称为 CTNR-A, VM-B 和 VM-A)以及一种开源交换机操作系统(CTNR-B). CrystalNet 的设计旨在与这些异构软件系统透明运行, 可扩展至其他设备软件, 并为用户提供统一的 API(§3.3).

CrystalNet 选择容器作为隔离设备的基本格式. 相比虚拟机, 容器在隔离运行时库方面开销更小, 能够在云端虚拟机内良好运行, 更重要的是, 它能隔离多个设备的虚拟接口以避免命名冲突. 我们使用 Docker 引擎来管理容器. 我们将在下文阐述如何应对运行异构软件的挑战.

统一的连接与工具层 (A unified layer for connections and tools). CrystalNet API 必须适用于我们希望仿真的所有设备. 然而, 异构的设备软件被供应商打包成不同的黑盒镜像. 为每种设备重新实现 API 并确保行为一致是一项艰巨甚至有时不可行的任务. 另一个工程挑战是, 大多数容器化的交换机操作系统必须在接口已存在的情况下启动, 而虚拟接口通常只能在 Docker 容器启动后才能放入其中.

为了解决这一问题, 我们设计了一个统一的物理网络层, 即 PhyNet 容器(图 4), 其运行时二进制文件与被测设备解耦:

alt text

这一层容器持有所有虚拟接口, 并按照目标拓扑进行连接.

我们在 PhyNet 容器中放置了通用工具, 如 Tcpdump, 数据包注入和抓取脚本. 大多数 CrystalNet API 也是针对这些 PhyNet 容器实现的, 而不是为每种设备重新实现.

随后, 我们使用相应的网络命名空间启动实际的设备软件.

因此, 设备软件无需任何代码更改即可运行——就像在现实生活中一样, 它们启动时物理接口已经存在. 即使软件重启或崩溃, 虚拟接口和链路依然存在. 运行 PhyNet 容器仅用于持有网络命名空间, 其开销微乎其微.

基于虚拟机的设备 (VM-based devices). 虽然部分供应商提供容器化镜像, 但其他供应商(如 VM-B 和 VM-A)仅提供其交换机软件的虚拟机(VM)镜像. 我们无法在云端直接运行基于 VM 的设备镜像, 因为公有云无法向单个 VM 附加数百个虚拟接口. 此外, 我们需要将这些基于 VM 的设备与其他容器连接, 并维护 PhyNet 容器层.

我们的解决方案是将 VM 镜像, KVM 管理程序(hypervisor)二进制文件以及用于生成设备 VM 的脚本打包到一个容器镜像中. 换言之, 我们在云端虚拟机上的容器内运行设备虚拟机. 这需要云端支持嵌套虚拟机(nested VM)特性. Microsoft Azure 及其他一些公有云均提供此特性. 如果缺乏此特性, CrystalNet 也可以配置裸金属服务器来运行基于 VM 的设备.

真实硬件 (Real hardware). 最后, CrystalNet 还允许运维人员将真实硬件连接到仿真拓扑中.

例如, CrystalNet 可以模拟一个完整的网络, 其中一个或多个设备被替换为真实硬件. 这使我们能够在比传统独立测试更逼真的环境中测试硬件行为. 每台真实硬件交换机都连接到一台"扇出(fanout)"交换机.

该"扇出"交换机将每个端口通过隧道传输到服务器上的虚拟接口. 这些虚拟接口由 PhyNet 容器管理, 并通过虚拟链路(见 §4.2)桥接, 从而连接到 CrystalNet 覆盖网络.

通过引入 PhyNet 容器, 从管理的角度来看, CrystalNet 能够以相同的方式对待设备, 无论它们是运行在容器中, 虚拟机中, 还是作为真实的物理设备.

4.2 Network links¶

There are two types of virtual links in CrystalNet, one for the data plane and the other for the management plane.

Data plane. The virtual data plane links should be seen as Ethernet links by the devices and should be isolated from one another. Furthermore, the virtual links must be able to go through underlying networks, including the cloud provider’s network, and the Internet. The ability to travel over Internet in a seamless manner is necessary to allow the emulation to span multiple public clouds, and to allow cloud-based emulations to connect to one or more physical devices.

We choose VXLAN over other tunneling protocols (e.g., GRE) because it meets our goal best – it emulates an Ethernet link, and the outer header (UDP) allows us to connect across any IP network, including the wide area Internet. We can even cross NATs and load balancers, since most of them support UDP.5

As shown in Figure 5, each device interface is a member of a veth pair [18], with the other side plugged into a bridge. Each bridge also has a VXLAN tunnel interface (if the remote device is on another VM), or another local veth interface. This is transparent to the device containers. We isolate each virtual link by assigning a unique VXLAN ID to each link. Orchestrator ensures that there is no ID collision on the same VM.

Management plane. Through the years, operators have developed tools based on direct IP access to devices through the management plane which is an out-of-band channel just for management. CrystalNet provisions this management plane automatically (Figure 6). Operators can run their management tools without any modifications, perform incremental configuration changes with the tools, and pull device state, just like in production environments.

CrystalNet deploys a Linux jumpbox 6 , and connects all emulated devices together. However, one cannot simply connect all management interfaces in a full L2 mesh - this would cause the notorious L2 storm in such an overlay. Instead, we build a tree structure – each VM sets up a bridge and connects to the Linux jumpbox via VXLAN tunnels. All emulated devices connect to the bridge of the local VM. Other jumpboxes, like a Windows-based jumpbox, connect to the Linux jumpbox via VPN. Finally, the Linux jumpbox runs a DNS server for the management IPs of the devices.

CrystalNet 中存在两类虚拟链路, 一类用于数据平面, 另一类用于管理平面.

数据平面 (Data plane).

虚拟数据平面链路应被设备视为以太网链路, 并应相互隔离.

此外, 虚拟链路必须能够穿透底层网络, 包括云提供商的网络和互联网.

这种无缝穿越互联网的能力对于允许仿真跨越多个公有云, 以及允许基于云的仿真连接到一个或多个物理设备至关重要.

我们选择 VXLAN 而非其他隧道协议(如 GRE)

因为它最符合我们的目标:

仿真了以太网链路
其外部头部(UDP)允许我们跨越任何 IP 网络(包括广域互联网)进行连接
我们甚至可以穿越 NAT 和负载均衡器, 因为它们大多数都支持 UDP

alt text

如图 5 所示, 每个设备接口都是一个 veth 对(veth pair)[18] 的成员, 其另一端插入网桥中. 每个网桥还拥有一个 VXLAN 隧道接口(如果远程设备位于另一台 VM 上)或另一个本地 veth 接口. 这对设备容器是透明的. 我们通过为每条链路分配唯一的 VXLAN ID 来隔离每条虚拟链路. 编排器确保同一台 VM 上不存在 ID 冲突.

系统整理 Network Virtual Links

虚拟链路(Virtual Link / Overlay network)就像是在现有的物理网络(Underlay network)之上, "挖"出一条只有特定数据包能走的"私人隧道"

实际上, VXLAN 的伟大之处在于它能“跨越三层传送二层”! [后面会专门解释]

它包裹的内容里含有 L2 MAC Addr. 这让虚拟机即便跨越了路由器, 也觉得自己还在同一个交换机下

(1) 为什么需要虚拟链路?

在传统的物理网络中，如果你想让两台电脑像在同一个交换机下一样通信，它们必须物理连接。但现在的需求变了:

跨地域连接: 你的两台服务器一个在北京，一个在上海，中间隔着广域网，但你希望它们逻辑上在同一个局域网（L2）里
多租户隔离: 在云平台（如阿里云、AWS）中，成千上万个用户共用物理机，必须通过虚拟链路给每个用户划分独立的网络空间

(2) 虚拟链路通常怎么实现?

GRE: 通用路由封装 (Generic Routing Encapsulation)
- 做法: 在原始数据包（比如一个私网 IP 包）外面套上一个 GRE 头部，再套上一个公网 IP 头部
- 如图:
- 特点:
  - 点对点: 通常用于连接两个固定的站点
  - 协议无关: 它不挑食，可以封装 IPv4、IPv6 甚至老掉牙的 AppleTalk 协议
VXLAN: 虚拟扩展局域网 (Virtual Extensible LAN)
- 做法: 它把 二层（以太网）帧 封装在 四层UDP包 中
- 如图:
- 特点:
  - 让虚拟机感觉自己在同一个“交换机”下，即便它们物理上分布在不同的网段甚至是不同的机房
  - VTEP: 负责封包和拆包的端点叫做 VTEP (硬件交换机 / 服务器里的虚拟交换机). 全称: VXLAN Tunnel End Point
- 数据包封装格式:
  - 内层: 原始的二层以太网帧 (VM MAC Addr)
  - VXLAN 头部: 贴上一个标签，标明这个包属于哪个虚拟网络（VNI）
  - UDP 头部: 目的端口通常是 4789。
- 为什么要套一层 UDP
  - 为了负载均衡: 路由器（三层设备）在做等价路由（ECMP）时，通常会根据 UDP 端口号来散列
  - 这样不同的虚机流量可以走不同的物理路径，不会堵在同一条路上

(3) 为什么 VXLAN 比 GRE 好?

在早期的 VPN 时代，GRE 确实是霸主，但到了大规模公有云（如 AWS、Azure）或超大规模数据中心，它的局限性就暴露无遗了

GRE 负载均衡极差, 它不能做 ECMP
- 现代数据中心为了提速，会使用 ECMP: 交换机在分配流量时，通常会根据"五元组" (SrcIP, DstIP, SrcPort, DstPort, Proto) 来计算哈希值，从而把数据包分散到不同的物理线路上
- 痛点: GRE 没有端口号，所有的流量（无论里面是哪个VM data）对于交换机来说，五元组中只有 IP 对，看起来都一模一样。这导致所有流量都会被挤在同一条物理链路上，其他的物理链路则在闲置
- VXLAN牛在: 因为外面套了一层 UDP，它会把内层原始包的特征提取出来，放进 UDP 的 SrcPort 字段
GRE 只有点对点的“死板”连接, 而 VXLAN 的 VNI 字段天然支持 "multi-tenant" 场景
- GRE 本质上是一种点对点的隧道技术, 因为它基于 IP 进行封装, 一般来说一台主机的IP地址只能对应一个GRE隧道
- VXLAN 则不同, 它通过 VNI 字段支持多租户, 天然支持多租户隔离

(4) 从 Encapsulation 角度, 深入理解 VXLAN 机制

VXLAN 的精髓在于 L2 over L3/L4。它不是从 Layer 4 开始封装，而是把一整个二层以太网帧塞进去

alt text

自顶向下的真实结构应该是:

内层数据(Payload):
- App 数据 + TCP/UDP + IP + L2 Header
- 注意: 这里包含了一个完整的原始 MAC 帧头部! L2 Header
VXLAN 头部: 包含 VNI (24位 ID)
外层传输(UDP): 目标端口 4789
外层网络(IP): 物理机/VTEP 的 IP 地址
最外层物理(L2 + L1): 物理网线上的电信号

从公式角度讲:

([App + TCP/UDP + IP + L2] + VXLAN) + UDP + IP + L2 + L1

括号内是"乘客", 括号外是"载具"

(5) 从 Encapsulation 角度, 深入理解 GRE 机制

GRE 通常被称为 L3 over L3。它通常不包含原始的二层 MAC 头部

alt text

自顶向下的真实结构应该是:

内层数据(Payload): App 数据 + TCP/UDP + IP
- 通常直接从 IP 层开始包装, 不包含 L2 头部!!!
GRE 头部: 标明封装的是什么协议
外层网络(IP): 物理路径的 IP
最外层物理(L2 + L1)

从公式角度讲:

([App + TCP/UDP + IP] + GRE) + IP + L2 + L1

发现了吗？GRE 的外面没有 UDP 层，直接就是 IP 层! 这属于"赤裸裸的封装"

(6) VXLAN 与 VLAN 的区别

VLAN (Virtual Local Area Network):
- 传统的虚拟局域网。
- 它在 layer 2 工作, 通过给以太网帧贴上一个 12 位的 Tag 来区分不同的网络
- 本质上是: 二层交换机内部的逻辑划分, 不能跨越三层网络!
VXLAN (Virtual Extensible LAN):
- 顾名思义，是“可扩展”的 VLAN
- 它在三层（网络层）之上构建二层网络, 本质上是一种隧道技术!

形象比喻一下:

VLAN 就像是"办公室隔断":
- 你在一个大办公室里用屏风挡出了几个小间。虽然大家还在一个屋檐下，但互相看不见
- 但是! 如果你想去另一个楼层的办公室，屏风就没用了
VXLAN 就像是"跨城专递":
- 你把整个“办公室隔断”（二层帧）打包塞进一个顺丰快递盒（UDP/IP），通过高速公路（三层网络）运到另一个城市的办公室，拆开包装，那个隔断依然原封不动地摆在那

所以这两个技术针对的场景完全不一样, 一个是"子网划分", 一个是"虚拟化跨域连接"

alt text

管理平面 (Management plane). 多年来, 运维人员开发了基于直接 IP 访问设备的工具, 这些访问通过管理平面进行, 而管理平面是一个仅用于管理的带外通道.

CrystalNet 自动配置此管理平面(图 6)

alt text

运维人员无需任何修改即可运行其管理工具, 使用工具执行增量配置变更, 并提取设备状态, 就像在生产环境中一样.

CrystalNet 部署了一个 Linux 跳板机(jumpbox), 并将所有仿真设备连接在一起.

然而, 不能简单地将所有管理接口连接成全网状二层(L2)网络 -- 这会导致此类覆盖网络中臭名昭著的"二层风暴"

相反, 我们构建了一个树状结构——每台 VM 设置一个网桥, 并通过 VXLAN 隧道连接到 Linux 跳板机

所有仿真设备连接到本地 VM 的网桥. 其他跳板机(如基于 Windows 的跳板机)通过 VPN 连接到 Linux 跳板机.

最后, Linux 跳板机运行 DNS 服务器, 用于解析设备的管理 IP.

Mock Up Emulation Boundary¶

不重要, 不关心, 省略

Implementation¶

CrystalNet consists of over 10K lines of Python code and includes a few libraries that interact with our internal services and public cloud. We do not customize the switch firmware images we receive from our vendors in any way. In this section, we elaborate some important implementation details.

CrystalNet 由超过 1 万行 Python 代码组成, 并包含少量用于与我们内部服务及公有云交互的库. 我们不对从供应商处接收的交换机固件镜像进行任何形式的定制. 在本节中, 我们将详细阐述一些重要的实现细节.

6.1 Prepare phase¶

Prepare API generates the input for Mockup. It consists of generating the topology and configurations, and spawning the VMs. The only input for Prepare is a list of device host names that operators want to emulate. Then the orchestrator interacts with internal network management services and clouds to execute the following steps.

Generating topology and configurations. For all devices in the input list, CrystalNet identifies the locations in the physical topology and computes a safe boundary. Then CrystalNet pulls all related topology, device configurations and routing states snapshots. All the information is then preprocessed and rearranged in a format that Mockup can understand. The preprocessing mainly includes adding a unified SSH credentials into the configurations, parsing and reformatting routing states, etc.

VM spawning. CrystalNet estimates the number of VMs needed, and spawns them on-demand using cloud APIs. This is key to scalability and reducing the cost. The VMs run a pre-built Linux image that includes all necessary software and supported device containers. Additional images may be pulled during runtime using the Docker engine.

The number and type of VMs needed for the emulation depend on factors. We do not want to spawn too many tiny VMs - this increases the burden on the orchestrator, and can also increase cost. At the same time, we do not want to make each VM too large (and pack a lot of devices on the same VM), since the kernel becomes less efficient in packet forwarding when the virtual interfaces is too high. We have also found that container-based devices typically require more CPU, while VM-based devices require more memory. Finally, CrystalNet requires nested VMs (§4.1) for emulating devices VMs rather than containers. Azure supports this option for only certain VM SKUs. Based on these considerations, we typically build emulations out of 4-core 8 or 16GB VMs, although we also use other SKUs under certain conditions.

Prepare API 负责生成 Mockup 阶段的输入. 它包括生成拓扑和配置, 以及启动虚拟机. Prepare 的唯一输入是运维人员希望仿真的设备主机名列表. 随后, 编排器与内部网络管理服务及云平台交互, 以执行以下步骤:

(1) 生成拓扑与配置

对于输入列表中的所有设备, CrystalNet 会识别其在物理拓扑中的位置并计算安全边界

然后, CrystalNet 拉取所有相关的拓扑, 设备配置及路由状态快照

所有信息随后被预处理并重新整理为 Mockup 能够理解的格式

预处理主要包括: 向配置中添加统一的 SSH 凭证, 解析并重新格式化路由状态等

(2) 虚拟机启动

CrystalNet 估算所需的虚拟机数量, 并使用云 API 按需启动它们. 这是实现可扩展性和降低成本的关键!

这些虚拟机运行一个预构建的 Linux 镜像, 其中包含所有必要的软件和受支持的设备容器. 在运行时, 可以使用 Docker 引擎拉取额外的镜像

仿真所需的虚拟机数量和类型取决于多种因素:

我们不希望启动过多的小型虚拟机——这会增加编排器的负担, 也可能增加成本

同时, 我们也不希望每个虚拟机过大(即在同一虚拟机上打包过多设备), 因为: 当虚拟接口数量过高时, 内核的数据包转发效率会降低

我们还发现:

基于容器的设备通常需要更多的 CPU
基于虚拟机的设备则需要更多的 Memory

最后, CrystalNet 需要嵌套虚拟化(nested VMs)来仿真设备虚拟机而非容器

Azure 仅针对特定的虚拟机 SKU 支持此选项.

基于这些考虑, 我们通常使用 4 核 8GB 或 16GB 的虚拟机来构建仿真环境, 尽管在特定条件下我们也会使用其他 SKU

6.2 Mockup phase¶

Mockup is the core part of CrystalNet. The time Mockup takes determines the time and cost overhead of running CrystalNet. Following the design in §4.1, Mockup has two steps. First, it sets up the PhyNet layer and the topology connections. Second, it runs the device software. We aggressively batch and parallelize various operations in the Mockup stage. See §8 for performance numbers. Below are the implementation lessons we learned and decisions we made.

Linux bridge or OVS (Open vSwitch)? Both Linux bridge and OVS can forward packets and integrate VXLAN tunnel. While CrystalNet supports both, we prefer the former, because we only need “dumb” packet forwarding on the virtual links. The Linux bridge is much faster to setup, especially when CrystalNet configures O(1000) tunnels per VM. For efficiency, we also disable iptables filtering and Spanning Tree Protocol on bridges.

Running different devices on different groups of VMs. Containers on the same host share the same kernel, which can cause problems. For example, we find that one switch vendor tunes certain kernel settings related to packet checksumming, which can cause collocated devices from other vendors to malfunction. To avoid such problems, CrystalNet typically does not instantiate devices from different vendors on the same VM. In short, we create groups of VMs, with each group dedicated to run devices from a particular vendor.

Health check and auto-recovery. VMs may fail or reboot without warning. CrystalNet includes a health monitor and repair daemon to recover from such failures. The daemon periodically checks the device uptime, and verifies link status by injecting and capturing packets from both ends. If a problem is found, it alerts the user and clears and restarts the failed VM using the APIs described earlier. Since VMs are independent of one another, other VMs do not need to be restarted or reconfigured.

BGP speaker at the boundary. Our production network relies on BGP routing. Therefore, we surround the emulation boundary with BGP speakers (§5) based on ExaBGP 3.4.17. It can inject arbitrary announcements, dump the received announcements for potential analysis, and does not reﬂect announcements to other peers.

Integrating P4 ASIC emulator. While the images from the three major vendors come with ASIC emulator [7], the open source switch OS CTNR-B does not have one. Therefore, we integrate it with the open source P4 behavior model, BMv2, which acts the ASIC emulator and forwards packets.

(1) Mockup 是 CrystalNet 的核心部分

Mockup 所需的时间决定了运行 CrystalNet 的时间和成本开销. 遵循 §4.1 中的设计, Mockup 包含两个步骤.

第一步, 建立 PhyNet 层和拓扑连接

第二步, 运行设备软件

我们在 Mockup 阶段对各种操作进行了激进的批处理和并行化. 性能数据参见 §8. 以下是我们吸取的实现经验教训和做出的决策

(2) Linux 网桥还是 OVS (Open vSwitch)?

Linux 网桥和 OVS 均可转发数据包并集成 VXLAN 隧道.

尽管 CrystalNet 支持两者, 但我们更倾向于前者, 因为我们在虚拟链路上只需要"傻瓜式"的数据包转发

Linux 网桥的设置速度要快得多, 尤其是当 CrystalNet 每台虚拟机配置 O(1000) 个隧道时.

为了提高效率, 我们还禁用了网桥上的 iptables 过滤和生成树协议

(3) 在不同的虚拟机组上运行不同的设备

同一主机上的容器共享同一个内核, 这可能会导致问题. 例如, 我们发现某交换机供应商调整了某些与数据包校验和相关的内核设置, 这会导致同一主机上来自其他供应商的设备发生故障.

为避免此类问题, CrystalNet 通常不会在同一台虚拟机上实例化来自不同供应商的设备

简而言之: 我们创建虚拟机组, 每个组专门用于运行来自特定供应商的设备

(4) 健康检查与自动恢复

虚拟机可能会在无预警的情况下发生故障或重启.

CrystalNet 包含一个健康监控和修复守护进程, 以从这些故障中恢复. 该守护进程定期检查设备运行时间, 并通过从两端注入和捕获数据包来验证链路状态.

如果发现问题, 它会向用户报警, 并使用前面描述的 API 清理并重启故障的虚拟机.

由于虚拟机彼此独立, 其他虚拟机无需重启或重新配置.

(5) 边界处的 BGP Speaker

我们的生产网络依赖于 BGP 路由. 因此, 我们基于 ExaBGP 3.4.17 在仿真边界周围部署 BGP Speaker(§5). 它可以注入任意通告, 转储接收到的通告以供潜在分析, 并且不会向其他对等体反射通告.

(6) 集成 P4 ASIC 仿真器

虽然来自三大主流供应商的镜像自带 ASIC 仿真器 [7], 但开源交换机操作系统 CTNR-B 并没有. 因此, 我们将它与开源 P4 行为模型 BMv2 集成, 后者充当 ASIC 仿真器并转发数据包.

类别 (Category)	相关系统/工具 (Systems/Tools)	主要功能 (Primary Function)	局限性或与 CrystalNet 的区别 (Limitations / Differences)
基础设施仿真服务	EmuLab [2], CloudLab [1]	定义拓扑/容量, 运行真实应用.	不支持定制固件; 受限于基础设施规模, 无法按需扩展.
数据平面仿真	Flexplane [25]	测试 ASIC 资源管理算法.	不针对控制平面或配置; 难以横向扩展.
SDN 压力测试	Kang and Tao [20]	利用容器测试 SDN 控制平面.	侧重于性能, 而 CrystalNet 侧重于 SDN 及传统网络的正确性.
容器化仿真器	MiniNet [18], MaxNet [29]	可在分布式集群运行的容器仿真.	1. 缺乏多云/物理设备集成; 2. 不支持异构黑盒固件/管理工具; 3. 无自动安全边界识别.
配置验证	Formal Verification [11-13, etc.]	使用形式化技术检查配置属性.	假设设备行为理想化. 资源消耗低, 适合作为 CrystalNet 之前的初步检查.
数据平面验证	Forwarding Table Verify [19, 21, etc.]	检测路由问题(黑洞, 可达性等).	互补关系: CrystalNet 为其提供仿真的转发表, 以实现主动检测.

CrystalNet: Faithfully Emulating Large Production Networks¶

TLDR¶

Introduction¶

Motivation¶

CrystalNet Design Overview¶

3.1 Design goals¶

3.2 Architecture¶

3.3 CrystalNet APIs¶

Mock Up Physical Networks¶

4.1 Heterogeneous network devices¶

4.2 Network links¶

Mock Up Emulation Boundary¶

Implementation¶

6.1 Prepare phase¶

6.2 Mockup phase¶

Related Work¶