Discussion and Related Work¶
Backwards Compatibility: We briefly sketch one possible path to incrementally deploying IRN. We envision that NIC vendors will manufacture NICs that support dual RoCE/IRN modes. The use of IRN can be negotiated between two endpoints via the RDMA connection manager, with the NIC falling back to RoCE mode if the remote endpoint does not support IRN. (This is similar to what was used in moving from RoCEv1 to RoCEv2.) Network operators can continue to run PFC until all their endpoints have been upgraded to support IRN at which point PFC can be permanently disabled . During the interim period, hosts can communicate using either RoCE or IRN, with no loss in performance.
Reordering due to load-balancing: Datacenters today use ECMP for load balancing [23], that maintains ordering within a flow. IRN’s OOO packet delivery support also allows for other load balancing schemes that may cause packet reordering within a flow [20, 22]. IRN’s loss recovery mechanism can be made more robust to reordering by triggering loss recovery only after a certain threshold of NACKs are received.
Other hardware-based loss recovery: MELO [28], a recent scheme developed in parallel to IRN, proposes an alternative design for hardware-based selective retransmission, where out-of-order packets are buffered in an off-chip memory. Unlike IRN, MELO only targets PFC-enabled environments with the aim of greater robustness to random losses caused by failures. As such, MELO is orthogonal to our main focus which is showing that PFC is unnecessary. Nonetheless, the existence of alternate designs such as MELO’s further corroborates the feasibility of implementing better loss recovery on NICs.
HPC workloads: The HPC community has long been a strong supporter of losslessness. This is primarily because HPC clusters are smaller with more controlled traffic patterns, and hence the negative effects of providing losslessness (such as congestion spreading and deadlocks) are rarer. PFC’s issues are exacerbated on larger scale clusters [23, 24, 29, 35, 38].
Credit-based Flow Control: Since the focus of our work was RDMA deployment over Ethernet, our experiments used PFC. Another approach to losslessness, used by Infiniband, is credit-based flow control, where the downlink sends credits to the uplink when it has sufficient buffer capacity. Creditbased flow control suffers from the same performance issues as PFC: head-of-the-line blocking, congestion spreading, the potential for deadlocks, etc. We, therefore, believe that our observations from §4 can be applied to credit-based flow control as well.