Evaluation¶
In this section, we perform software simulations using NS3 [6] and use a hardware testbed equipped with RNICs to evaluate the performance of ConWeave. Particularly, we seek to answer the following questions:
(1) How effective is the ConWeave’s active congestion-evasive rerouting? We compare ConWeave’s performance to existing state-of-the-art load balancing algorithms using both simulation and hardware testbed.
(2) What are the resource requirements of ConWeave in terms of buffer space and per-connection queues?
(3) What are the hardware resource and bandwidth consumption of ConWeave when implemented on a programmable switch such as the Tofino2 switch?
原论文中有 Software Simulations 和 Hardware Testbed Evaluations 两部分,这里我们只介绍 Software Simulations。
Software Simulations¶
We first present the setup for the simulation evaluation.
Network topologies: The topology in NS3 simulation is a Clos topology [8] with the over-subscription ratio of 2:1. By default, we use a leaf-spine topology which is common in data center clusters. The topology consists of 8 × 8 leaf-spine switches, and 128 servers (16 servers for each rack). All links are 100Gbps with 1𝑢𝑠 latency. For the switch model, we enable buffer sharing for flexible buffer allocation using a publicly available source code [41]. Each switch has a buffer size of 9MB.
Workloads: Fig. 11 shows several industry data center workloads available in the literature. We use the AliCloud storage [40] and Meta Hadoop [53] workloads in the simulation. The SolarRPC workload will be used in the hardware testbed evaluation. We schedule a flow by randomly selecting a pair of client and server and then select a flow size from the chosen flow size distribution. Inter-flow arrival times follow a Poisson distribution and the average flow arrival rate is used to control the overall traffic load intensity. Due to the space limitation, we will only show the results for AliCloud storage. The result for Meta Hadoop is shown in Appendix §B.
Transport: We use DCQCN [63] which is the standard congestion control scheme for commodity RDMA NICs [31, 49]. Since the recommendation in [63] does not fit into our setup due to a different scale, we find the parameters giving the low latency and high throughput, e.g., (𝐾 𝑚𝑖𝑛 , 𝐾 𝑚𝑎𝑥 , 𝑃 𝑚𝑎𝑥 ) = (100𝐾𝐵, 400𝐾𝐵, 0.2) based on the observations in [40]. For the rest of the parameters, we follow the recommendations in the recent Mellanox driver/firmware [50].
Network flow controls: We implement two flow control mechanisms as follows:
- Lossless RDMA - Go-Back-N loss recovery and priority-based flow control (PFC).
- IRN RDMA [44] - Selective-Repeat for loss recovery and the endto-end flow control that bounds the number of in-flight packets to 1 BDP (BDP-FC).
Schemes compared: We compare ConWeave with ECMP, Conga [11], Letflow [59], and DRILL [23]. For Conga and Letflow, we choose a flowlet time gap of 100𝑢𝑠. For DRILL, we use the recommended setting DRILL(2,1), i.e., choosing a new output port with the smallest queue among 2 random samples and 1 current port. For ConWeave, the default parameters used are shown in Table 3. Specifically, 𝜃 reply is a timeout value for RTT _ REPLY. A smaller value allows more frequent rerouting, but too small a value may result in excessive rerouting. We present how we find the default value in appendix B.1. 𝜃 path_busy is the duration to avoid using the congested path after NOTIFY is received. The value of 𝜃 path_busy is chosen based on the ECN marking threshold. For instance, if the threshold is 100KB, then 𝜃 path_busy should correspond to the minimum time required to flush 100KB (e.g., 8𝑢𝑠 for a 100G link). Lastly, 𝜃 inactive is the duration to start a new epoch based on an inactivity period. It must be long enough so that a new epoch starts without out-of-order packets. For instance, we used 300𝑢𝑠 for leaf-spine topology.
我们将ConWeave与ECMP、Conga [11]、Letflow [59]和DRILL [23]进行比较。对于Conga和Letflow,我们选择了100𝑢𝑠的流量间隔时间。对于DRILL,我们使用推荐的设置DRILL(2,1),即从2个随机样本和1个当前端口中选择一个具有最小队列的输出端口。对于ConWeave,使用的默认参数如表3所示。具体而言,𝜃 reply是RTT _ REPLY的超时值。较小的值允许更频繁的重新路由,但值过小可能导致过度的重新路由。我们在附录B.1中介绍了如何确定默认值。𝜃 path_busy是接收到NOTIFY后避免使用拥塞路径的持续时间。𝜃 path_busy的值是基于ECN标记阈值选择的。例如,如果阈值是100KB,则𝜃 path_busy应对应于清除100KB所需的最短时间(例如,100G链路上为8𝑢𝑠)。最后,𝜃 inactive是基于不活动时间段启动新纪元的持续时间。它必须足够长,以确保新纪元在没有乱序包的情况下开始。例如,我们在叶脊拓扑中使用了300𝑢𝑠。
Performance metrics: As the primary metric, we use FCT slowdown, i.e., a flow’s actual FCT normalized by the base FCT when the network has no other traffic. To measure the overhead and effectiveness of ConWeave, we record the usage of the number of reorder queues per egress port and the reorder queue memory usage per switch by sampling every 10𝑢𝑠 from all nodes.
性能指标:作为主要指标,我们使用FCT放慢,即流的实际FCT与网络中没有其他流量时的基准FCT的比值。为了衡量ConWeave的开销和有效性,我们通过每10𝑢𝑠从所有节点采样,记录每个出口端口的重新排序队列数量以及每个交换机的重新排序队列内存使用情况。
4.1.1 Reduction in FCT. We run the simulations with 50% and 80% average traffic loads which represent a moderately and highly loaded network, respectively. In Fig. 12 and Fig. 13, we show the average and tail FCT slowdowns. In some instances, DRILL’s performance figures are not included because the FCTs are too large to be included without substantial change in the scale.
For moderate traffic loads (i.e., 50%), ConWeave improves the average and 99-percentile FCT slowdowns for overall flow sizes by at least 23.3%, 45.8% in lossless RDMA, and 12.7%, 46.2% in IRN RDMA when compared to others. On the other hand, in a highly loaded network (e.g., 80%), the average and tail FCTs improvement are at least 17.6%, 35.8% in lossless RDMA, and 42.3%, 66.8% in IRN RDMA. Our results show that ConWeave is effective in rerouting flows away from congested links and provides significant improvements over the baseline algorithms.
4.1.1 FCT减少。在50%和80%平均流量负载下,我们进行仿真,分别代表中等负载和高负载网络。在图12和图13中,我们展示了平均FCT和尾部FCT的放慢情况。在某些情况下,由于DRILL的FCT过大,无法在不显著改变比例的情况下包含其性能数据,因此未包括DRILL的性能结果。
对于中等流量负载(即50%),与其他算法相比,ConWeave在无损RDMA中对整体流量大小的平均和99百分位FCT放慢分别提高了至少23.3%和45.8%,在IRN RDMA中则提高了12.7%和46.2%。另一方面,在高负载网络(例如80%)下,平均和尾部FCT的改善分别至少为17.6%和35.8%(在无损RDMA中),以及42.3%和66.8%(在IRN RDMA中)。我们的结果表明,ConWeave在将流量从拥塞链路中重新路由方面是有效的,并且相较于基线算法提供了显著的改进。
4.1.2 Load balancing efficiency. In this evaluation, we investigate ConWeave’s load balancing efficiency. Fig. 14 shows the CDF of throughput imbalance [11] across the 8 uplinks for each ToR switch using 50% and 80% average load. The throughput imbalance is defined as the maximum throughput minus the minimum throughput divided by the average (among the uplinks). We calculate it using snapshots sampled every 100𝑢𝑠 from all nodes. From the result, we observe that except for DRILL, ConWeave is the most effective in terms of spreading the load to various links. Recall that DRILL performs per-packet switching resulting in a large amount of out-oforder packets. Hence, while it achieves good load balancing among the links, it has poor application performance over RDMA.
4.1.2 负载均衡效率。在此评估中,我们研究了ConWeave的负载均衡效率。图14展示了在50%和80%平均负载下,每个ToR交换机的8条上行链路的吞吐不平衡度的CDF。吞吐不平衡度定义为最大吞吐量减去最小吞吐量,再除以平均值(在所有上行链路中的平均值)。我们通过每100𝑢𝑠从所有节点采样的快照来计算这一指标。从结果中我们可以观察到,除了DRILL之外,ConWeave在将负载分配到各个链路方面是最有效的。回想一下,DRILL执行每包交换,导致大量乱序包。因此,尽管DRILL在链路间实现了良好的负载均衡,但它在RDMA上的应用性能较差。
4.1.3 Hardware resource consumption. Fig. 15 shows the number of queues used per switch egress port. Most of the time, we observe that ConWeave only needs to support less than 10 queues for reordering regardless of the network loads. In the worst case, the number of queues needed does not exceed 15. Given that the number of queues available per egress port found on commodity programmable switches ranges from 32 up to 128 [24], this shows that ConWeave requires only a fraction of the queues for reordering.
Fig. 16 shows the total buffer memory usage per switch for packet reordering. In general, ConWeave in lossless RDMA consumes more buffer memory than IRN RDMA. This is because while the flow control (BDP-FC) in IRN limits the in-flight packets to one BDP, lossless RDMA can keep sending packets of a flow during its packet reordering process and thus consuming more buffer memory. Specifically, in lossless RDMA with 80% network load, the 99.9-percentile and maximum queue overhead is 1.5MB and 2.4MB, respectively. Even so, these numbers correspond to only a fraction of available buffer space on commodity switching ASICs which typically have tens of MBs [1, 5] of them. We discuss ConWeave’s scalability and its alternative design options in §5.
4.1.3 硬件资源消耗。图15展示了每个交换机出口端口使用的队列数量。我们观察到,在大多数情况下,ConWeave只需支持少于10个队列进行重排序,无论网络负载如何。在最坏情况下,所需的队列数量不超过15个。考虑到商业化可编程交换机每个出口端口可用的队列数量通常在32到128之间[24],这表明ConWeave在进行重排序时只需要很少的一部分队列。
图16展示了每个交换机用于数据包重排序的总缓冲区内存使用情况。一般来说,ConWeave在无损RDMA中的缓冲区内存消耗高于IRN RDMA。这是因为在IRN中,流量控制(BDP-FC)将飞行中的数据包限制为一个BDP,而无损RDMA在数据包重排序过程中可以持续发送流的包,因此消耗了更多的缓冲区内存。具体来说,在80%网络负载下的无损RDMA中,99.9百分位和最大队列开销分别为1.5MB和2.4MB。尽管如此,这些数值仅占商业交换ASIC上可用缓冲空间的一小部分,通常这些ASIC的缓冲空间有几十MB[1, 5]。我们将在§5中讨论ConWeave的可扩展性及其替代设计选项。
4.1.4 Three-tier Topology. So far, the evaluations have been performed with a two-tier (Clos) topology. In this section, we evaluate ConWeave on a three-tier topology (i.e., more hops). A 3-tier topology introduces more hops and thus potentially longer response time and more cross-traffic variation. We use a fat-tree topology [8] with its parameter 𝑘 = 8 and the over-subscription ratio of 2:1, which involves 256 servers in total (8 servers for each rack), and the average network load is 60%. In lossless RDMA, we use 8𝑢𝑠 for 𝜃 reply , 16𝑢𝑠 for 𝜃 path_busy , and 1𝑚𝑠 for 𝜃 inactive .
We depict the results in Fig. 17. We find that across the lossless and IRN RDMA, ConWeave improves the average and 99-percentile FCT slowdowns by at least 21.4%, 40.8% for short (<1BDP) flows, and 40.1%, 57.8% for long (>1BDP) flows, respectively. Overall, ConWeave outperforms the baseline load balancing mechanisms on the 3-tier topology.
4.1.4 三层拓扑。迄今为止,评估均是在二层(Clos)拓扑上进行的。在本节中,我们将在三层拓扑上评估ConWeave(即更多的跳数)。三层拓扑引入了更多的跳数,因此可能导致更长的响应时间和更多的交叉流量变化。我们使用一个Fattree拓扑[8],其参数𝑘 = 8,过载比为2:1,总共涉及256台服务器(每个机架8台服务器),平均网络负载为60%。在无损RDMA中,我们使用8𝑢𝑠作为𝜃 reply,16𝑢𝑠作为𝜃 path_busy,1𝑚𝑠作为𝜃 inactive。
我们在图17中展示了结果。我们发现,无论是无损RDMA还是IRN RDMA,ConWeave在短流量(<1BDP)上至少提高了21.4%和40.8%的平均和99百分位FCT放慢,以及,对于长流量(>1BDP),ConWeave分别提高了40.1%和57.8%的平均和99百分位FCT放慢。总体而言,ConWeave在三层拓扑上优于基线负载均衡机制。