Survey of Intra-Node GPU Interconnection in Scale-Up Network: Challenges, Status, Insights, and Future Directions

Xiaoyong Song; Danyuan Zhou; Kai Li; Jiayuan Chen; Hao Zhang; Xiaoguang Zhang; Xuxia Zhong

doi:10.3390/fi17120537

,

and

¹

China Mobile Research Institute, No.32 Xuanwumen West Street, Xicheng District, Beijing 100053, China

²

China Mobile, No. 29, Finance Street, Xicheng District, Beijing 100033, China

^*

Author to whom correspondence should be addressed.

Future Internet2025, 17(12), 537;https://doi.org/10.3390/fi17120537

Version Notes

Order Reprints

Abstract

Nowadays, driven by the exponential growth of parameters and training data of AI applications and Large Language Models, a single GPU is no longer sufficient in terms of computing power and storage capacity. Building high-performance multi-GPU systems or a GPU cluster via vertical scaling (scale-up) has thus become an effective approach to break the bottleneck and has further emerged as a key research focus. Given that traditional inter-GPU communication technologies fail to meet the requirement of GPU interconnection in vertical scaling, a variety of high-performance inter-GPU communication protocols tailored for the scale-up domain have been proposed recently. Notably, due to the emerging nature of these demands and technologies, academic research in this field remains scarce, with limited deep participation from the academic community. Inspired by this trend, this article identifies the challenges and requirements of a scale-up network, analyzes the bottlenecks of traditional technologies like PCIe in a scale-up network, and surveys the emerging scale-up targeted technologies, including NVLink, OISA, UALink, SUE, and other X-Links. Then, an in-depth comparison and discussion is conducted, and we express our insights in protocol design and related technologies. We also highlight that existing emerging protocols and technologies still face limitations, with certain technical mechanisms requiring further exploration. Finally, this article presents future research directions and opportunities. As the first review article fully focusing on intra-node GPU interconnection in a scale-up network, this article aims to provide valuable insights and guidance for future research in this emerging field, and we hope to establish a foundation that will inspire and direct subsequent studies.

Keywords:

scale-up network; intra-node GPU interconnection; high-speed interconnect protocol

1. Introduction and Background

In recent years, Artificial Intelligence (AI) has achieved groundbreaking progress across diverse domains, from natural language processing (NLP) and computer vision (CV) to scientific computing, driven by the continuous evolution of model architectures and the exponential growth of model parameters and training data. To pursue higher accuracy and broader applicability, large-scale deep learning models such as Large Language Models (LLMs) and Mixtures of Experts (MoE) have emerged. Further fueled by the development of generative AI, these models now feature exponentially growing parameter scales and increasingly complex computational graphs.

This trend toward model scaling, while enabling unprecedented performance leaps, has also drastically escalated the demand for both computational power and memory resources. On the one hand, the computational power demanded by complex training iterations and high-throughput inference tasks frequently surpasses the on-chip processing capability of a single AI processor, such as a Graphics Processing Unit (GPU (for the sake of expression simplicity, this paper uses “GPU” as a generic term encompassing AI accelerators and AI chips, including but not limited to General-Purpose Computing on Graphics Processing Units (GPGPUs), Neural Processing Units (NPUs), Tensor Processing Units (TPUs), and other domain-specific AI accelerators (collectively referred to as XPUs))). Distributed parallel training on large-scale GPU clusters has become inevitable. For example, Meta’s LLaMa-2 70 B (Billion), a dense model, was trained for approximately 35 days and consumed a total of 1.72 million NVIDIA A100 GPU hours [], and the pretraining of LLaMA-3 takes approximately 54 days with 16K H100-80 GB GPU on Meta’s production cluster [,]. Similarly, DeepSeek-V3, a 671B-parameter MoE model, underwent training on 14.8 trillion high-quality tokens with 2048 NVIDIA H800 GPUs, whose full training lasted about 54 days and required 2.788 million H800 GPU hours []. On the other hand, the parameter scale of LLMs has undergone explosive and continuous expansion, which is a trend that directly drives the sharp growth of their demand for storage resources, and far exceeds the on-chip storage capacity of a GPU. During training, additional storage is required for gradients and optimizer states, intermediate activation values, and training data, pushing the total memory requirement to several terabytes (TB). By contrast, the memory of a single GPU is typically 24 GB (Gigabyte) to 80 GB, and even a high-end GPU struggles to handle this load. For instance, in Moonshot Kimi-K2, storing the model parameters in BF16 (Brain Floating Point 16) and their gradient accumulation buffer in FP32 (Single-Precision Floating Point 32) requires approximately 6 TB of GPU memory, distributed over a model-parallel group of 256 H800 GPUs []. Given the inherent limitations of a single GPU in meeting the requirements of trillion-parameter model training or high-performance inference, developing an efficiently coordinated multi-GPU system or high-performance GPU cluster has become an inevitable solution to break through these constraints.

As Figure 1 shows, multi-GPU or GPU cluster system implementations can be categorized into two primary paradigms: scale-up and scale-out [,]. Scale-up, a vertical scaling approach, integrates multiple GPU cards within a single server; scale-out, by contrast, adopts horizontal scaling to efficiently interconnect multiple servers. More specifically, scale-up enhances a server’s computational and memory capabilities by integrating as many GPU cards as possible, leveraging intra-node GPU interconnects through point-to-point (P2P) communication. It employs memory semantics (e.g., load/store operations) and targets scenarios with smaller transaction scales within the server node, requiring high memory bandwidth and low-latency interconnection between multiple GPUs in the same server node. Essentially, a multi-GPU system expanded via scale-up can be regarded as an integrated “super GPU” or “giant GPU” possessing ultra high performance. Scale-out, on the other hand, improves the overall performance of GPU clusters by enabling efficient inter-node communication across multiple server nodes. It is primarily used in extreme large-scale deployment or data center scenarios, catering to large-scale data transmission between server nodes, and adopts network or message semantics like send/receive or RDMA (Remote Direct Memory Access) [].

Figure 1. The overview of scale-up and scale-out.

Although multi-GPU systems are the critical infrastructure for overcoming the memory and computing power limitations and enabling efficient training or inference, the totally effective computing power of a cluster is not the simple product of the single-GPU computing power and cluster scale. As Figure 2 shows, the distributed training of LLMs relies on parallelization strategies like tensor parallelism (TP), data parallelism (DP), and pipeline parallelism (PP), etc., which partition model parameter tensors, training data, and model layers across GPUs for parallel execution [,]. Meanwhile, extensive parameter synchronization and data transmission are required among GPUs during the distributed training or inference, placing extremely high demands on communication efficiency and network performance, for both the intra-node GPU interconnection within a single server node and the inter-node interconnection across multiple server nodes. Additionally, in MoE models, a typical approach is to distribute the experts across different GPUs as a single GPU cannot store all experts []. During the execution of MoE layers, there is also an intensive need for data exchange among GPUs. In the forward pass of several popular MoE models, the communication among devices accounts for 47% of the total execution time on average []. Due to the substantial communication overhead introduced by the distributed training, the cluster’s computing power cannot increase linearly with its size. The interconnect technology becomes the key to breaking through this bottleneck, and the performance of GPU interconnection directly determines the overall performance of the GPU cluster. In this paper, we mainly focus on the intra-node GPU interconnection within the scale-up domain.

Figure 2. The overview of 3D (dimension) parallelism in distributed LLM training, including tensor parallelism (TP), data parallelism (DP), and pipeline parallelism (PP).

Unfortunately, traditional interconnection technologies fail to meet the emerging scale-up network requirements in the high-bandwidth domain (HBD), with large-scale multi-GPU deployments hindered by bottlenecks rooted in conventional architectures. The Peripheral Component Interconnect Express (PCIe) technology is limited by its master–slave architecture and tree topology, centering on the Central Processing Unit (CPU), resulting in low peer-to-peer communication efficiency and restricted interconnection scale []. Moreover, the escalating imbalance between computational power and storage capacity has exposed critical flaws in conventional “CPU + GPU” interconnection architecture based on PCIe for performance scaling in multi-GPU systems, exacerbating the “memory wall” problem, a key obstacle to scaling multi-GPU systems and improving the overall performance []. Specifically, while GPUs outperform CPUs in computational power by a large margin, their memory capacity is typically an order of magnitude lower than that of CPUs. Compounding this, the rapid advancements in CPU and GPU computing performance have far outpaced improvements in memory bandwidth and latency optimization. Compute Express Link (CXL) significantly enhances GPU memory access efficiency by reducing remote memory access latency and improving bandwidth utilization through cache coherence protocols and memory pooling technology, but the technology is still developing, with commercial solutions and ecosystem maturity yet to be improved [,]. Moreover, RDMA-based or traditional Ethernet-based solutions suffer from large protocol header overhead and high end-to-end latency, failing to meet fine-grained communication needs. Proprietary interconnection protocols, due to their closed ecosystems, limit the flexibility of hardware selection and absence of switch chips. Overall, although these interconnection technologies can technically interconnect multiple GPUs in a server node, they are ill-suited for high-performance, large-scale scale-up deployments, constraining efficiency (e.g., throughput, latency) and scalability, and failing to meet modern multi-GPU demands.

Against this backdrop, traditional interconnect technologies can no longer keep up with current requirements, prompting the successive proposal of a series of new scale-up protocols which aim to establish a large-scale, high-performance scale-up domain. NVIDIA proposed the NVLink technology in 2014 and it has been applied in its GPU products with continuous evolution []. AMD also introduced its Infinity Fabric Link technology in its products to enhance the interconnection capability between GPU computing units from 2017 [,]. Notably, in recent years, alongside the exponential growth in the scale of large models and the new requirements for scale-up interconnection posed by model training and inference, the development of intra-node GPU interconnection technologies has entered a brand-new phase. Apart from the proprietary interconnection protocols, open interconnection protocols are also continuously being introduced. ChinaMobile proposed the Omnidirectional Intelligent Sensing Express Architecture (OISA) in 2023 and publicly published its newest version 2.0 recently [], UALink Consortium Promoter Group proposed Ultra Accelerator Link (UALink) in 2024 and published its specification in 2025 [], and Broadcom also publicly introduced its Scale-up Ethernet (SUE) protocol in 2025 [].

1.1. Literature Review and Our Motivations

This work is conducted as a review, aiming to comprehensively explore the intra-node GPU interconnection technologies for scale-up networks, while ensuring the transparency and reproducibility.

During the research process, we noticed there are many survey papers for RDMA (e.g., [,,]) and systematic research for data center networks (e.g., [,,]) or other hot scale-out technical topics. In contrast, the discussions on intra-node GPU interconnection technologies for scale-up are relatively limited, especially in academia. Instead, the development of these technologies is mostly led by GPU and switch manufacturers in the industry. This scarcity arises from the recent emergence of core demands. The explosive growth of trillion-scale parameter LLMs and AI applications has only taken place in the past few years, rendering the urgent need for ultra-high-performance intra-node interconnection to overcome computing and storage bottlenecks a newly emerging research focus. With the gradual opening of some interconnection protocols and the continuous clarification of application demands, the academic community has begun active participation in related research, but it is still in its early stages.

As shown in Table 1, ref. [] focuses on the evolution of PCIe technology; refs. [,] center their research on CXL technology; refs. [,] focus on RDMA-related technical fields; and ref. [] mainly analyzes and discusses Ultra Ethernet and UALink technologies. All these references are limited to exploration and review in a single technology. Refs. [,] conduct investigations on technologies such as PCIe, NVLink, and GPUDirect and perform practical tests on related technologies in combination with NVIDIA GPU products. Ref. [] pays more attention to the internal interconnection technologies of multi-GPU systems compared with the aforementioned surveys, covering not only PCIe and NVLink but also other private interconnection protocol X-Links (such as AMD Infinity Fabric Link, Intel X^e Link, and BLink). However, with the massive emergence of new open interconnection protocols in recent two years, its content needs further updating and supplementing. Compared with existing surveys, the proposed survey fully focuses on intra-node GPU interconnection and conducts a comprehensive and systematic review and analysis of traditional solutions such as PCIe, CXL, and Ethernet-based approaches, as well as various representative emerging interconnection protocol and technologies.

Table 1. Comparison of the existing surveys.

Given the relative scarcity of academic reviews on emerging intra-node GPU interconnection technologies, we prioritized both publicly accessible academic papers and official industry technical specifications or whitepapers, focusing on works related to intra-node GPU hardware interconnection technologies. The search was conducted between January 2025 and August 2025. We conducted a systematic search across core academic databases, official technical platforms, and authoritative industry organization websites, including IEEE Xplore, ACM Digital Library, arXiv, Google Scholar, SpringerLink, ScienceDirect, and MDPI, as well as official technical portals of key technology providers (e.g., NVIDIA, UALink Consortium, etc.) and relevant standardization groups. Key search terms covered “GPU scale-up”, “multi-GPU system”, “intra-node GPU interconnection”, “GPU-GPU communication”, “PCIe”, “CXL”, “RDMA”, “NVLink/NVSwitch”, “UALink”, “Scale-up Ethernet (SUE)”, “OISA”, and other technical keywords, ensuring no critical public research or open specifications were omitted.

The literature selection strategy considers both temporal relevance and technical alignment. On the one hand, from the temporal perspective, we prioritized works published over the past decade to balance novelty and evolutionary completeness while also incorporating foundational literature and specifications to clarify the technical development context. On the other hand, from the perspective of technical relevance, we strictly included studies focusing on intra-node GPU interconnection technologies and excluded works on pure software optimization, inter-node networking, etc.

This survey focuses on the following core research objectives and scope:

Based on the ultra-high-performance inter-GPU interconnection demands of the new AI era, a comprehensive review of intra-node GPU interconnection technologies in the GPU scale-up domain is conducted, systematically surveying and analyzing representative traditional and emerging technologies or protocols.
The existing performance bottlenecks of these technologies and identify future optimization directions are analyzed.
The academic gap in this understudied field is filled, providing a clear reference framework for relevant researchers.
This inspires in-depth exchanges and discussions between academia and industry, thereby facilitating collaborative innovation across the two sectors.

1.2. Contributions

This survey focuses on the intra-node GPU interconnection in the scale-up domain. The key contributions of this work are as follows:

We introduced and deeply analyzed the limitations and disadvantages of traditional interconnection technology such as PCIe in the AI era and the field of scale-up and discussed the design ideas that can be referenced in their technologies.
We introduced and compared the representativeness and influence of emerging interconnect technologies aiming for the scale-up high-bandwidth domain and especially discussed and analyzed the key issues and core considerations in their design.
We presented our insights and perspectives on the future challenges and opportunities and the development trends in the field of scale-up interconnection.

As far as we know, this paper is the first comprehensive survey fully focusing on GPU scale-up networks, discussing emerging interconnection technologies and delivering insightful perspectives.

The remainder of this paper is organized as follows. Section 2 analyzes the requirements, challenges, and status of intra-node GPU interconnection in a scale-up network. Section 3 outlines traditional inter-GPU interconnect solutions including PCIe, CXL, and Ethernet-based approaches and analyzes the bottlenecks they encounter. Section 4 surveys emerging representative scale-up protocols, such as NVIDIA NVLink, OISA, UALink, SUE, and other X-Links. Section 5 compares these emerging protocols, discusses their core design considerations, and shares our perspectives and insights across various dimensions. Section 6 assesses the current challenges for emerging methods and explores technical trends and future research directions in the field. Finally, Section 7 concludes the survey.

2. Requirements, Challenges, and Status of GPU Interconnection in Scale-Up Network

2.1. Requirements

Driven by current model and application requirements, in a scale-up system featuring multi-GPU collaboration within a single node, GPU interconnection in scale-up networks must meet the requirements for high-performance, lossless, and efficient memory accessing to enable the realization of a “Giant GPU” or a cohesive multi-GPU system.

Extremely High Bandwidth: A scale-up network strives to build a high-bandwidth domain (HBD), which is indispensable to cope with the massive data transfer demands in scenarios such as LLM training and high-performance computation (HPC) [,,,]. Frequent exchanges of parameters, intermediate calculation results, and other data between GPUs depend on sufficient bandwidth; otherwise, it will become a bottleneck that causes the computing power of GPUs to be underutilized [,]. Compared to the bandwidth of scale-out networks or traditional data center networks, the bandwidth requirement of scale-up networks is an order of magnitude higher []. Specifically, such demands reach 400 Gb/s or even 800 Gb/s per GPU port, hundreds of gigabytes per second, or even the terabytes per second per GPU, aiming to adapt to the speed of high-bandwidth memory [,,].

Ultra-low Latency: Within the HBD, latency requirements are equally stringent []. The single-hop forwarding latency of a switch must be capped at just a few hundred nanoseconds, while the Round-Trip Time (RTT) between any two GPUs should be maintained within a few microseconds, to ensure global memory access at a near-local speed in the “Giant GPU” [,]. This is crucial for tasks with intensive synchronization needs and the overall performance of GPU clusters []. On the one hand, operations like gradient synchronization in distributed training and collaborative instruction interactions among GPUs are extremely sensitive to latency. Reducing latency can remarkably shorten the overall execution cycle of tasks and enhance real-time performance in inference scenarios. On the other hand, reducing the communication latency between GPUs can prevent the waste and idleness of computing resources caused by GPUs waiting for data to arrive, which is important for the release of cluster performance [].

Scalability: Scalability is essential to accommodate the growing number of GPUs per node, a trend driven by today’s complex workloads [,,]. As computing power demands surge, a single node requires integrating more GPUs to form a high-performance computing unit, making it imperative for interconnection technologies to support scaling up GPU cluster sizes while preserving performance. An overly small scale-up domain will result in a large amount of inter-server communication and protocol conversion. As exemplified by Deepseek-V3 [], which is trained in NVIDIA H800 servers with 8 GPUs per node, the constrained scale-up scope leads to substantial computational resource (e.g., Streaming Multiprocessors—SMs) waste and excessive programming overhead in protocol conversion and communication guarantees []. Expanding the scope of the scale-up domain enables converting some scale-out communications into scale-up domain interactions. Leveraging its high bandwidth, low latency, and unified addressing capabilities, this approach better boosts the efficiency of applications like model training and inference.

Lossless: Lossless transmission serves as the cornerstone of GPU collaborative computing within the scale-up domain, as reliable data delivery directly underpins the stability and efficiency of multi-GPU coordinated tasks. At its core is the establishment of a robust network infrastructure, designed to guarantee the accuracy and integrity of data transmission within the high-bandwidth domain. This requires avoiding packet loss and errors during data transmission and demands rapid recovery if such issues occur [,]. Otherwise, the system performance will be affected due to data retransmission or packet loss []. Therefore, scale-up networks must incorporate key supporting technologies, including flow control (to regulate data flow and prevent congestion-induced packet loss) and retransmission (to recover lost or corrupted packets promptly), to fully uphold lossless transmission standards and lay a stable foundation for GPU-to-GPU data interaction.

Unified Memory Addressing: Unified memory addressing (UMA) stands as another core requirement for scale-up networks, mandated to resolve critical bottlenecks in collaborative computing efficiency, specifically those stemming from storage resource isolation and the “memory wall” phenomenon [,].

Figure 3 shows three ways for different GPUs within a server node to access memory. Path ① indicates that if a GPU wants to write data to another GPU, it first needs to copy the data from its own memory to the host’s memory and then write it to the storage of the other GPU. Path ② indicates that through the PCIe switch, GPUs can directly transfer data between each other. In Path ③, based on the new interconnection technology’s memory semantic mechanism and UMA mechanism, the unified memory pool is established and GPUs can directly access each other’s storage.

Figure 3. (a) Three conceptual models of GPU memory access. (b) Memory spaces without UMA. (c) Unified memory based on UMA.

Fulfilling the unified memory addressing requirement delivers multi-dimensional benefits for scale-up systems. It enables seamless cross-GPU memory access, allowing each GPU to treat remote memory as if it were local and eliminating the overhead of explicit data copying that often wastes bandwidth and time. It also simplifies the programming model for multi-GPU applications, reducing the complexity of memory management for developers and lowering the barrier to multi-GPU software development. Additionally, it lays a critical hardware foundation for unified memory pooling and dynamic scheduling, supporting flexible allocation of memory resources across the entire scale-up system to match real-time computing demands.

2.2. Challenges

Building an ultra-high-performance GPU scale-up network entails multiple challenges, with core hurdles centering on reconciling performance bottlenecks, interconnect protocol limitations, interconnection topologies, efficiency, reliability, power consumption, and so on. These challenges are not isolated. Instead, they are deeply intertwined and mutually reinforcing, creating a more complex barrier than any single issue alone. From our current perspective, the two most prominent challenges stem from interconnect protocol design and hardware limits and physical constraints.

Protocol Design: Designing an efficient, feasible interconnect protocol is extremely challenging, as the protocol directly shapes efficiency and performance in the scale-up domain.

The protocol must strike a delicate balance between simplicity and reliability, a dual requirement that adds further complexity to the design [,,]. Firstly, the protocol must prioritize simplicity in packet format and mechanisms to ensure efficient transmission. On the one hand, large protocol headers severely consume bandwidth, leaving a link’s actual throughput far below its physical capacity, especially in scenarios where GPUs exchange massive, frequent small packets [,,]. In such cases, relative to the small payloads, the overhead of headers becomes disproportionately large. For example, a 32-byte header attached to a 128-byte data payload results in a total packet size of 160 bytes, with the header consuming 20% of the link’s capacity, which is a waste. This waste becomes exponentially more damaging in multi-GPU clusters, where large amounts of such small packets are exchanged across the cluster every second, turning the incremental overhead of each individual packet into a systemic bottleneck that throttles the entire collaborative computing workflow. On the other hand, overly redundant mechanisms, such as complex handshakes or duplicate error checks, must be avoided, as they introduce unnecessary processing delays that undermine the low-latency demands of GPU collaboration. Meanwhile, scale-up interconnect protocols must not overlook transmission reliability. They must dynamically regulate transmission speeds via flow control to avoid congestion or packet loss while also integrating recovery logic to address packet loss or transmission errors.

Moreover, protocol design must account for scalability, a requirement deeply intertwined with interconnect topology and transmission distance as the GPU scale-up domain expands [,]. As the number of GPUs in a scale-up system grows, the protocol must adapt to the network topology evolution, which is from simple direct point-to-point connected topology to complex switch-based fully connected architectures. Traditional protocols like PCIe, designed for small-scale, fixed-topology peripheral connections, are ill-suited for large-scale GPUs interconnection []. Transmission distance adds another layer of complexity. GPU interconnections span short on-board links (centimeters) to rack-level (meters) or cross-rack (tens of meters) connections, with signals degrading differently across ranges. The protocol should be flexible to balance reliability and efficiency.

Overall, protocol design is crucial yet highly challenging in large-scale GPU interconnect systems. It needs to integrate multiple attributes, including efficiency, reliability, scalability, etc. However, traditional protocols like PCIe and RoCE (RDMA over Converged Ethernet), as they were not purpose-built for large-scale intra-node GPU interconnect scenarios, struggle to meet these comprehensive requirements, ultimately becoming a key bottleneck that limits the improvement in system performance.

Hardware Limits and Physical Constraints: The realization of high performance and scalability is constrained by hardware capabilities and physical boundaries [,,,].

As the primary gateway for data transmission, SerDes (Serializer/Deserializer) at the physical layer poses severe constraints on both speed and density. Designing and implementing high-rate SerDes (e.g., 112 Gbps and 224 Gbps) is extremely challenging due to issues like signal integrity, circuit area, and power consumption []. Even with targeted optimizations, the number of SerDes ports that can be integrated for parallel transmission remains limited by the physical area and power budget of GPU dies or switch chips, directly restricting the maximum direct bandwidth a single device can provide.

Moreover, the scalability and interconnection topology of scale-up networks are tied to hardware limits, particularly the port counts of GPUs or switch chips [,,]. When the number of interconnected GPUs in the scale-up domain exceeds the port capacity of individual GPUs or switches, multiple switches must be deployed, inevitably introducing higher latency, more complex network management, and other complications. Together, these hardware limits and physical constraints form an interconnected set of challenges that directly restrict the scale, throughput, and efficiency of GPU scale-up networks.

2.3. The Development of GPU Scale-Up

From the perspective of development stages, we think that the evolution of GPU scale-up interconnection technology in recent years can be divided into the following three stages.

The first stage is the initial stage, where the industry generally relied on traditional PCIe or CXL technologies to achieve GPU interconnection [,,,]. However, such traditional solutions soon become a performance bottleneck for their shortcomings in bandwidth, latency, and interconnection scale [,,].

The second stage entered the exploration stage of proprietary protocols, where GPU manufacturers began to self-develop proprietary interconnection protocols based on PCIe or Ethernet technologies. However, except for NVIDIA’s NVLink, most proprietary protocols can only achieve an interconnection scale of eight cards due to the lack of supporting switch chips, making it difficult to support larger-scale computing power clusters [,,].

The third stage has entered the large-scale collaborative innovation stage. As the demand for GPU interconnection becomes clear, the design of various protocols has also become increasingly mature. GPU manufacturers and switch manufacturers have launched joint research and development to propose more new solutions supporting large-scale interconnection [,,]. On the one hand, they generally redefine the new public or standard protocols to achieve the interoperability among devices from different vendors, adopt a fully connected topology centered on switches to expand interconnection scale, and leverage high-speed SerDes technology to boost data rates. On the other hand, they integrate technologies such as Chiplet, IO Die, and optical interconnection to further reduce the power consumption and area of chips while achieving efficient interconnection, adapting to the needs of larger-scale computing power clusters [].

3. Traditional Solutions

3.1. PCIe

Peripheral Component Interconnect Express (PCIe) was first proposed by Intel in 2001 and is currently governed by the Peripheral Component Interconnect Special Interest Group (PCI-SIG) for developing []. Over more than two decades of development, it has maintained a consistent iteration rhythm: a major version update approximately every three years, with backward compatibility preserved across iterations. Functionally, PCIe has long served as a universal interface for interconnecting CPUs with GPUs, storage devices, Network Interface Cards (NICs), Data Processing Units (DPUs), Field Programmable Gate Array (FPGA) cards, and so on []. PCIe 6.0 was officially released in 2022, and PCIe 7.0 specification was also made available to its organization members in 2025, and PCI-SIG aims to release the PCIe 8.0 specification in 2028.

PCIe is a high-speed serial point-to-point dual-channel bus designed for high-bandwidth transmission. As shown in Figure 4a, a PCIe link comprises multiple lanes, and each lane consists of a pair of differential signal pairs (one for transmission, one for reception), enabling full-duplex communication. The PCIe link can be flexibly configured as ×N (where N denotes the number of lanes, e.g., ×1, ×4, and ×16) to balance cost and performance requirements.

Figure 4. (a) Overview of PCIe link. (b) Comparison of NRZ and PAM4.

Largely driven by the continuous increase in per-lane bit rate across successive versions, PCIe has achieved an approximate doubling of bandwidth every three years. Beyond this, other key factors collectively boost its performance. Its integration has shifted from external chipsets to CPU cores, with supporting for more lanes. Meanwhile, encoding schemes have been upgraded to reduce transmission overhead and enhance reliability, and mechanisms like Forward Error Correction (FEC) have been introduced to balance latency and bit error rate (BER). Notably, encoding upgrades stand out. Starting from PCIe 6.0, 4-Level Pulse Amplitude Modulation (PAM4) has replaced the earlier Non-Return-to-Zero (NRZ) encoding. As shown in Figure 4b, unlike NRZ, which transmits 1 bit per clock cycle, PAM4 uses four signal levels to send 2 bits per cycle. This upgrade doubles the data rate to 64 GT/s at the same Nyquist frequency as PCIe 5.0’s NRZ, while PAM4’s operation at half the frequency reduces channel loss, keeping PCIe 6.0’s transmission distance comparable to PCIe 5.0’s. Detailed parameters of different PCIe versions are presented in Table 2.

Table 2. Specification of each generation PCIe.

Although PCIe has achieved great success as an I/O bus, and its bidirectional bandwidth is up to 256 GB/s from version 6.0, it still faces multiple constraints that limit its application in AI scenarios and the GPU scale-up HBD.

Firstly, the topology and architecture of PCIe-based interconnection hinders the efficiency and scalability of GPU communication. PCIe’s master–slave or root complex (RC) architecture and tree topology block direct P2P communication between GPUs. To maintain backward compatibility with PCI software, PCIe only allows for tree-based hierarchical architecture which centers on the CPU with external devices mounted on the PCIe bus, prohibiting loops or other complex topological structures []. All inter-GPU communication must be relayed via the root complex or multi-level PCIe switches, with no direct path available. In GPU-intensive scale-up scenarios, all traffic must route through the CPU or a central PCIe switch, introducing unnecessary latency and wasting CPU resources. Even when P2P communication is enabled, due to the limited number of lanes of the CPU or PCIe Switch, the interconnection scale is restricted. Even with PCIe switch cascading, large-scale, high-efficiency expansion remains challenging.

Secondly, PCIe struggles with bandwidth mismatches and inefficient memory access. As the proportion of GPUs in multi-GPU systems grows, PCIe bandwidth remains far lower than the bandwidth between CPUs/GPUs and their respective memories. This gap is particularly pronounced in large-scale data movement across GPUs, where high throughput is critical. For current mainstream GPUs designed for AI workloads, the local memory bandwidth between the GPU core and on-board DRAM (Dynamic Random Access Memory) like GDDR (Graphics Double Data Rate) or HBM (High Bandwidth Memory) has reached a magnitude that far outpaces PCIe. For example, the local memory bandwidth of NVIDIA H100 SXM5 GPU (with HBM3 memory) reaches 3350 GB/s [], and AMD instinct MI300X (with HBM3) also exceeds 5000 GB/s []. By contrast, cross-GPU memory access over PCIe is extremely inefficient in terms of bandwidth and latency.

Furthermore, PCIe has not established a true shared memory pool for multi-GPU setups and lacks common coherence. Each GPU’s memory is independent with isolated address spaces, precluding unified memory addressing. When GPUs access shared data, coherence issues arise, requiring manual software-managed synchronization that further reduces efficiency.

These factors make traditional PCIe interconnection technology difficult to support the efficient interconnection and the expansion of the scale-up high-bandwidth domain. Nonetheless, PCIe remains an indispensable core protocol and the most widely adopted interface standard for communication between GPUs and CPUs in current computing systems. Its mature ecosystem, backward compatibility, and cost-effectiveness ensure it continues to play a key role in the scale-up domain.

3.2. CXL

Due to PCIe’s limitations in storage expansion and memory coherency, Compute Express Link (CXL) was proposed in 2019, with its Version 3.0 protocol officially released in 2022. As a high-speed cache-coherent interconnect protocol, CXL is designed for processors, memory expansion, and accelerators, with a core goal of addressing key bottlenecks in high-performance computing, including constrained memory capacity, insufficient memory bandwidth, and high I/O latency. CXL is built on top of the PCIe physical layer, which not only leverages the mature PCIe hardware ecosystem but also enables reliable cache and memory coherency across devices. By tapping into the widespread availability of PCIe interfaces, CXL further enables seamless memory sharing across a diverse set of hardware components, such as CPUs, NICs, DPUs, GPUs, other accelerators, and memory devices [].

CXL provides a rich set of protocols that include I/O semantics similar to PCIe (i.e., CXL.io), caching protocol semantics (i.e., CXL.cache), and memory access semantics (i.e., CXL.mem) over a discrete or on-package link []. CXL leverages the PCIe physical data link to map devices, referred to as endpoints (EPs), to a cacheable memory space accessible by the host. This architecture allows for compute units to directly access EPs via standard memory requests [], while CXL memory further supports cache-coherent access through standard load/store instructions.

With the theoretical analysis and memory simulation or emulation, CXL seems to be a promising memory interface technology to address PCIe’s shortcomings in memory access. However, many genuine CXL memory expansion tests show that it is difficult to meet the memory demands in the AI era. For example, the study in [] shows that the performance gains from tensor offloading based on CXL memory are very limited. This is because the tensor copies between CXL memory and GPU memory have to go through a longer data path compared to memory access directly from the CPU, which is also bottlenecked by the PCIe physical layer.

Although CXL can support certain AI scenarios, it is more of a design for disaggregated memory accessing in computing rather than a specialized AI network solution. It enables the GPU to access shared or pooled memory with extremely high efficiency, rather than directly enhancing the point-to-point data transmission between GPUs. Furthermore, the design built on and compatible with PCIe renders it rather complex and bulky and fails to meet the demands of the scale-up HBD network.

3.3. Ethernet-Based Solutions

Ethernet-based approaches, particularly RDMA (e.g., RoCE, RDMA over Converged Ethernet), have also been widely adopted for a time. However, neither RoCE nor standard Ethernet is designed for GPU scale-up networks, and both face critical limitations stemming from inefficient communication workflows and excessive protocol overhead. RoCE exemplifies these challenges.

On the one hand, RoCE relies on message semantics, with communication workflows involving multi-stage interactions between Work Queue Elements (WQEs), doorbell signals, and NICs [,]. Before data is finally transmitted, GPU cores must first copy data, store WQEs, await acknowledgments, and trigger doorbells, each step incurring microsecond-scale latency. Compounding this, the workflow is tightly bound to the host-side PCIe bus, whose bandwidth and latency constraints further degrade communication efficiency as GPU counts scale. While parallelizing requests can partially offset latency, it introduces prohibitive memory overhead (e.g., inflated send buffers), undermining scalability.

On the other hand, packet protocol overhead poses significant challenges for scale-up networks. Ethernet’s inherent frame structure (including preamble, header, and FCS fields) introduces baseline overhead, and RDMA’s control headers (for addressing, Queue Pair management, etc.) further expand packet size []. Although RoCEv2 excels at large data transfers in scale-out networks, it incurs non-negligible overhead in scale-up scenarios. Here, frequent small transactions and fine-grained parameter exchanges during multi-GPU synchronization exacerbate the header-to-payload ratio, wasting bandwidth and increasing end-to-end latency.

Table 3 compares these three traditional solutions. However, none fully meets the demands of current large-scale intra-node GPU interconnection. PCIe is limited by scalability, CXL’s ecosystem remains developing, and RoCE suffers from higher latency which is more suitable for cross-node interconnection. Together, their inherent constraints highlight the need for more optimized interconnect solutions for large-scale GPU clusters. As a result, for the evolving and highly specialized AI landscape, new solutions that are more specifically designed for AI needs are urgently required. Consequently, for the evolving and highly specialized AI landscape, new intra-node GPU interconnection solutions for scale-up are urgently needed.

Table 3. Comparison of intra-node GPU interconnect technologies.

4. Emerging Representative Solutions

As demands for computing, memory, and cross-GPU communication continue to rise, the aforementioned traditional technologies can no longer meet the requirements of GPU interconnection. Therefore, a series of new protocols and technologies have emerged, which are reshaping the scale-up HBD. The scale-up protocol and its core technologies have now become a key enabler for breaking through performance bottlenecks and supporting complex AI scenarios. Among these emerging protocols, several have already gained considerable industry influence, backed by robust technical specifications and product support systems.

In this section, we will briefly introduce some representative scale-up technologies or protocols, including NVLink, OISA, UALink, SUE, and other X-links, concerning the protocol stack structure, packet format, interconnection topology, and other key technical features. In the next section, we will explore and discuss the design ideas of these emerging scale-up interconnection technologies.

4.1. NVLink

As an advanced interconnect technology and communication protocol, NVLink is redefining the paradigms of communication and cooperation between GPUs, even between CPUs and GPUs []. NVIDIA proposed NVLink firstly in 2014, and the NVLink 1.0 was introduced in its P100 GPU in 2016. By 2024, NVLink evolved to its fifth generation, deployed in the GB200/GB300 GPUs featuring the Blackwell architecture [,]. Notably, starting with NVLink 2.0, the NVLink Switch (NVSwitch) was integrated into intra-node GPU interconnections, marking a key breakthrough in expanding the scale of GPU interconnects and laying the foundation for more efficient data transmission. The combination of NVLink and NVSwitch brings significantly higher communication bandwidth and lower transmission latency, which in turn substantially enhances the overall performance and efficiency of the multi-GPU system.

As Figure 5 shows, the protocol architecture of NVLink is layered and divided into the transaction layer (TL), data layer (DL), and physical layer (PHY), and NVLink adopts a P2P architecture and serial transmission to implement GPU-GPU communication. Similar to the PCIe link structure, each NVLink link comprises a pair of sub-links, and each sub-link handles one transmission direction and contains eight differential signal pairs. As illustrated in Table 4, NVLink has undergone iterative improvements in lane speed, lane count, and signaling mode, with each generation approximately doubling the link speed. Now, the P2P bidirectional bandwidth between GPUs has increased from 160 GB/s in NVLink 1.0 to 1800 GB/s in NVLink 5.0 [].

Figure 5. Architecture of NVLink Protocol [].

Table 4. Specification of each generation NVIDIA NVLinks and NVLink Switches.

As shown in Figure 6a, NVLink adopts the Flit (Flow Control Digit) mode to organize its data for transmission. Each NVLink packet consists of 1∼18 flits, each of which contains 128-bit data. This design enables flexible transmission of variable-sized data. Except a header flit, an address extension (AE) flit and a byte enable (BE) flit are used to optimize the data transmission efficiency by retaining the static bits and transmitting only the changing bits. The header flit is structured with a 25-bit cyclic redundancy check (CRC) field, an 83-bit transaction field, and a 20-bit data link field []. The transaction field includes critical information such as transaction type, address, flow control bit, and tag identifier, while the data link carries packet length, application number labels, and acknowledgment identifiers.

Figure 6. The packet format overview of (a) NVLink 1.0 Flit, (b) OISA 2.0 TLP, (c) UALink 1.0 Packed TL Flits, (d) UALink 1.0 DL Flit (640B), (e) SUE RM103 Packet, (f) PCIe Packet, (g) RoCEv2 Packet.

In practice, NVIDIA has deployed different generations of NVLink across its GPU product lines, paired with evolving interconnection topologies. Early systems like the P100-DGX-1 (NVLink 1.0) [] and V100-DGX-1 (NVLink 2.0) [] used a hybrid cube-mesh topology for GPU interconnection. While these implementations significantly boosted peer-to-peer (P2P) bandwidth, scalability was constrained by the number of links per GPU, limiting the interconnection scale. To address this, NVIDIA introduced the NVSwitch chip starting with the NVLink 2.0-based DGX-2 system [], enabling a switch-based fully connected topology.

The introduction of NVSwitch has brought a critical breakthrough to GPU interconnection at the topological level. First, with its increased number of NVLink ports, it successfully overcomes the previous limitation of relying on the number of links per GPU, enabling the construction of a non-blocking fully connected topology. This design allows all GPUs within a server to interconnect directly without routing through intermediate GPUs or CPUs, delivering a qualitative improvement in system scalability and aggregate bandwidth. As shown in Table 4, the latest NVSwitch (Gen 4.0) combined with NVLink 5.0 supports fully connecting up to 72 GPUs in its NVL72 products and is pursuing fully connecting up to 572 GPUs, completely breaking the scale constraints of early topologies. Meanwhile, it achieves a P2P bandwidth of 1800 GB/s and a total aggregate bandwidth of 130 TB/s, further validating the practical value of this scalability innovation. Building on this foundation, NVSwitch has further driven optimization of system performance. On the one hand, it can bypass traditional CPU-based resource allocation and scheduling mechanisms, and by integrating technologies such as Unified Virtual Addressing (UVA) [], it endows the NVLink Graphics Processing Cluster (GPC) with the capability to directly access local and cross-GPU HBM2 memory. This enables direct data exchange between GPUs without redundant data copying, significantly reducing latency and resource consumption caused by CPU-mediated transfers. On the other hand, its integration of NVIDIA SHArP technology provides hardware-level acceleration for collective communication operations such as All-Gather, Reduce-Scatter, and Broadcast Atomics, further reducing latency and boosting throughput in multi-GPU collaboration scenarios [,,].

In general, NVLink addresses the bottleneck limitations of traditional PCIe in scale-up GPU interconnection and significantly enhances the interconnect capability between multiple GPUs. NVIDIA has also rolled out a series of SuperPod server products based on NVLink technology, further validating the success of this technology. Additionally, NVIDIA has introduced NVLink Fusion [], which enables the integration of CPUs or switches from other manufacturers as hardware-based products within rack enclosures via NVlink C2C (Chip-to-Chip) IP (Intellectual Property) or NVLink IP. However, the specific technical details of NVLink remain undisclosed. Against this backdrop, a range of alternative interconnect technologies have been developed.

4.2. OISA

Omnidirectional Intelligent Sensing Express Architecture (OISA) [] was proposed by China Mobile in 2024 and evolved to version 2.0 in 2025, which aims to build an efficient, intelligent, flexible, and open GPU intra-node interconnection system, supporting data-intensive AI applications such as large model training, inference, and high-performance computing.

As Figure 7 shows, the protocol stack of OISA comprises three functional layers, the transaction layer (TL), the data layer (DL), and the physical layer (PL). Positioned at the top of the protocol stack, the transaction layer interfaces with the GPU’s NOC (Network on Chip). Upon receiving transaction-related signals transmitted from the GPU NOC, such as the transaction type, target GPU, address information, and transaction data, the TL maps and encapsulates this information into OISA transaction layer packets (TLPs), which are subsequently forwarded to the data layer. The DL inserts CRC fields into the received TLPs to enable error detection. OISA’s physical layer adopts an Ethernet-based implementation, leveraging mature Ethernet physical-layer technologies like SerDes to support reliable and high-speed data transmission between GPU devices.

Figure 7. Protocol stack of OISA.

OISA defines three independent packet memory access modes, namely, Extreme-performance Access (EPA) mode, Intelligent Sensing Access (ISA) mode, and collective communication acceleration (CCA) mode.

The EPA mode is centered on achieving minimum single-hop forwarding latency and maximum payload efficiency as its core objectives. In the switching topology architecture, switch chips can adopt a streamlined processing pipeline, which only execute essential steps such as packet parsing, address lookup, and port forwarding to achieve line-speed, low-latency forwarding. This mode is primarily designed for latency-sensitive memory semantic communication transactions.

Beyond this, OISA also emphasizes intelligent collaboration between GPUs and between GPUs and switches in the scale-up domain. In the ISA mode, the OISA processing engine (OPE) of the sender GPU inserts an intelligent sensing tag after the protocol field, allowing for the sensing and collection of critical state information of the switch and the destination GPU along the transmission path. The receiver GPU provides feedback as needed based on pre-configured alert mechanisms and feedback packet formats, thereby providing key state information and bottleneck locations to help the source GPU adjust subsequent data transmission strategies, including reducing the sending rate or changing the forwarding path. In the CCA mode, OISA offloads a subset of computing tasks to the switch chip, which then performs data calculations and collective communication operations like AllReduce and returns results directly to the GPUs. In this mode, switch chips no longer serve merely as simple packet forwarding nodes but rather become integral components of the computing system. This approach reduces communication traffic among GPU nodes generated by interactive computing data and intermediate results while also shortening overall task processing latency to achieve acceleration. Ultimately, this boosts the overall performance of the GPU cluster.

The packet format of OISA is shown in Figure 6b, which retains the protocol type field at the position of the 13th and 14th bytes to ensure compatibility with Ethernet forwarding, with other functional fields arranged in the regions preceding and following this field. The core components of an OISA packet include its TLP header, protocol type field, memory address, data payload, and checksum field. The TLP header of OISA contains two categories of key information, forwarding-related fields such as the source GPU identifier (SRC GPU ID), destination GPU identifier (DEST GPU ID), and virtual channel (VC), etc., as well as transaction processing-related fields like the transaction type, Transaction Tag ID, packet length, etc. In contrast to other flit-mode designs, OISA imposes no specific constraints on the payload length. Instead, it supports a variable-length payload, with its exact size indicated by the packet length field in the header explicitly. Notably, unlike other protocol packets, OISA balances standard forwarding between GPU chips and switch chips from different vendors with accommodation for the differences in architectures of various GPUs or their transaction operations. It specifically reserves several user-defined fields (UDFs) in the TLP header to enable users to implement customized extension designs. Based on the packet format of EPA, OISA designs the ISA packet and CCA packet with optional extension fields including the sensing tag fields and CCA Extend fields, respectively, and dedicated protocol types or signaling bits are used to indicate such variable packet structures and recognize ISA packets or CCA packets. To further increase the data payload ratio, OISA proposes aggregating multiple small transaction packets of the same type at the transaction layer. The aggregated TLP carries multiple small data packets with a single header, thereby spreading the fixed protocol overhead across multiple payloads and further boosting network bandwidth utilization.

In terms of interconnect topology, OISA supports multiple topological structures, including direct interconnection via full-mesh topology and various interconnect topologies based on switching chips. Considering scalability and interconnect efficiency, it primarily adopts a switch-based single-layer fully connected topology, with forwarding and routing via GPU ID. By establishing a unified memory space within the scale-up domain, OISA enables efficient memory access among GPUs. Regarding communication modes, OISA mainly supports two types of memory semantics, synchronous memory semantics based on load/store and asynchronous memory semantics based on direct memory access (DMA), adapting to different granularities and requirements. To establish a lossless scale-up network, OISA introduces a flow track to sense the state of the network, priority-based flow control (PFC) and buffer-aware flow control mechanisms for traffic control, and data-layer retransmission (DLR) technologies at the data link layer, collectively ensuring lossless and efficient data transmission. Furthermore, at the physical layer, OISA leverages Ethernet-based SerDes to enable high-speed transmission. It employs Lite-FEC (RS272) and reduces interleaving ways to minimize latency, thus forming an efficient physical link layer.

In summary, as a GPU interconnect protocol designed for scale-up scenarios, OISA achieves efficient data transmission under GPU expansion of different scales through multi-dimensional technical optimizations. It not only focuses on efficient packet design and packet aggregation to reduce transmission overhead but also places particular emphasis on in-depth collaboration between GPUs, and between GPUs and switches, specifically in terms of traffic control and data computing. Meanwhile, while maintaining compatibility with the Ethernet physical layer, OISA exhibits flexibility in its upper protocol layers, enabling it to break through bandwidth and latency bottlenecks and adapt to various scenarios of multi-GPU clusters.

4.3. UALink

Ultra Accelerator Link (UALink), as an emerging standard to enable high-performance scale-up interconnects for next generation AI workloads, was proposed in 2024 initially and the UALink 200G 1.0 specification was officially released in April 2025 []. This specification was initially developed by the UALink Consortium Promoter Group, whose members include leading enterprises spanning technology, cloud computing, semiconductor, and system solution sectors, such as Alibaba, AMD, Astera Labs, Intel, and so on. It is now also supported and adopted by more than 70 contributors and adopter members [,].

As Figure 8 shows, UALink 200G (UAL200) employs a layered architecture comprising the UALink Protocol Layer (UPL), transaction layer (TL), data layer (DL), and Ethernet PHY layer (PL) and provides symmetric paths in sending and receiving. The first three layers (UPL, TL, and DL) are distinctive to UALink, while the PHY layer is compatible with the standard Ethernet ecosystem. Each layer has its distinct functions and works in coordination, achieving low-latency and high-reliability data transmission between accelerators. At the protocol layer, the UAlink Protocol Layer Interface (UPLI) serves as the interface of the UALink stack, which consists of two pairs of Originator and Completer components across inbound and outbound channels, linking accelerators to the TL. The TL bridges the UPLI and DL, and it is responsible for the conversion between UPLI messages and TL flits, and between TL flits and DL flits. The DL sits between the TL and the PL, providing core functions such as the message service (including rate advertisement, device and port ID query, etc.) between link partners, DL flit packing, link-level replay, CRC computation and validation, and so on. The UAL200 physical layer (PL) is based on IEEE P802.3dj and IEEE 802.3 physical layer specifications []. The PL supports multi-channels and multi-rates and sends DL flit after serialization coding.

Figure 8. (a) UAlink Protocol Stack, (b) UALink Protocol Features [].

Different from the traditional Ethernet-based packet with a hierarchical packet header structure or NVLink TLP flit mode, UALink delicately designs a set of flit-based packet formats, and the packet format varies across layers, especially in the TL and DL. In the UPLI, a set of channels transmits the original transaction information. In the transaction layer, the transactions are packed as TL flits. As shown in Figure 6c, each TL flit has a fixed length of 64 bytes, and a TL half-flit is fixed at 32 bytes. Based on UPLI signal types, TL flits are categorized into control half-flits and data flits. The former contain the control information like the request, response, flow control (FC), etc., while the latter carry the payload data and the byte mask. There may be one or multiple requests, responses, or FC fields in each TL control half-flit, and the responding data flits are arranged in order after the control half-flit. In the data layer, multiple 64-byte TL flits are repackaged into 640-byte fixed-size DL flits (Figure 6d), enabling accurate alignment with the RS (544,514) codeword in the PHY. Meanwhile, a 4-byte CRC field and a 3-byte DL header are appended into each DL flit, with five 1-byte segment headers inserted at a fixed position within the DL flit.

Regardless of flit length, UALink also imposes specific requirements on the placement of control half flits and data half flits, as well as the maximum number of requests in each half-flit and the data flit in each transaction. For example, a single write request or read response supports a maximum payload length of 256 bytes, equivalent to four TL flits or eight half-flits. Moreover, the request or response in control half-flit could be compressed, and the requests or responses across multiple destinations also could be packed into the same flit. Benefiting from header compression and efficient packaging, UALink achieves extremely high data transmission efficiency. In TL flits, when 20 out of 21 flits are data flits, the data efficiency can reach up to 95.2% []. In the data layer, there are only 12 bytes (3 bytes for the flit header, 5 bytes fort he segment header, and 4 bytes for the CRC) are used for DL functions, enabling data link efficiency of up to 98.125% (628/640) [].

Based on UALink switches, UALink adopts a single-layer multi-plane switching architecture to build a large-scale HBD, which supports up to 1024 GPUs in the scale-up domain and up to 800 Gbps per port. The number of switch planes scales with the bandwidth of the GPU for flexible expansion, and the rate of the port is adapted with the need. In the established scale-up domain, UALink enables direct communication among GPUs with memory sharing. From a memory semantic perspective, UALink uses small packets for low-latency load/store/atomic memory accesses and finely interleaves memory channels (e.g., with 256 B granularity) to maximize bandwidth to both local and peer GPU memory.

Furthermore, UALink introduces multiple key mechanisms to enable efficient and reliable data transmission []. To prevent congestion or packet loss, it employs ready-valid handshake, credit-based flow control (CBFC), and rate pacing/adaptation, ensuring smooth data interaction. Meanwhile, UALink also provides end-to-end protection for both control and data traffic from the source GPU to the destination GPU, where parity checking, CRC, and FEC are employed at its TL, DL, and PL, respectively. Even if packet loss or data corruption occurs on the link, the Link Layer Retransmission (LLR) would immediately retransmit missing or corrupted data flits to guarantee a lossless network. The protection mechanisms vary slightly across different layers, for instance, the transaction layer uses parity checking, the data layer employs CRC, and the physical layer adopts FEC. Moreover, it also simplifies the Eth-based FEC interleave to reduce the latency in the PHY.

Overall, through multiple technologies including packet design, switch-based topology, and physical layer optimization, UALink enables high-performance, low-latency data transmission and provides an open solution for building a large-scale HBD for GPU interconnection. Especially, by leveraging refined flit design and packaging rules in the transaction layer and data layer, it achieves extremely high packet transmission efficiency, and the specifications of DL flits are precisely aligned with the FEC encoding in the physical layer. However, the specialized protocol stack interfaces may have compatibility problems with the existing interfaces for some GPUs. Additionally, although the flit design and efficient package improves the overall performance in data transmission, its sophisticated packet flit design, where its packs requests and responses (even across multiple destinations) into the same FLIT, increases the difficulty of the flit parser and its chip-level implementation.

4.4. SUE

Scale-Up Ethernet (SUE) is an Ethernet-based GPU interconnection framework proposed by Broadcom Corporation. Its core goal is to provide low-latency, high-bandwidth connectivity for GPU clusters from the rack level to the cross-rack level, support efficient transfer of in-memory transactions (e.g., put, get, and atomic operations) between GPUs, and meet the parallel processing requirements of complex workloads such as machine learning and AI reasoning []. The initial specification version of SUE, Scale-Ethernet-RM100, was released in April 2025 and remains under update. The discussion in this article is based on the latest specification (RM103) and its AFH (AI forwarding header) Gen2 packet format.

SUE works over standard Ethernet, but simplified and optimized for AI scenarios and intra-node GPU interconnection, enabling efficient memory transaction transfer between GPUs in the scale-up domain. As Figure 9 shows, SUE has a layered protocol stack. At the top of the SUE stack, it provides a duplex data and control interface, connecting SUE with the NOC and getting commands and data. The mapping and packing layer is responsible for packing and mapping transactions from the GPU upper layer to form the core payload of the SUE protocol Data Unit (PDU). The transport layer provides the reliability header (RH) and a 32-bit reliability CRC (RCRC) to the PDU. The PDU, RH, and RCRC constitute the core carrier for transmission, implement sequence control and ACK/NACK feedback, and ensure the integrity of the PDU in transmission. The network layer encapsulates the data link header (SUE AFH) for the SUE PDU, forwarding the packets by the destination GPU ID. The bottom of the protocol stack is the Ethernet-based physical layer, allowing it to reuse Ethernet’s high-speed SerDes and other mature technologies.

Figure 9. Overview of SUE stack []. (SUE Lite stack do not have the layer of transport).

Besides the standard mode, SUE also offers a lite encapsulation mode called SUE Lite. Both SUE and SUE Lite share the same hierarchical stack architecture but differ in content composition, packet format, and header length. SUE Lite simplifies the protocol stack by removing the reliable transport layer, eliminating the reliability header to reduce hardware overhead and transmission latency, and only retains the AI forwarding header for data forwarding. In distinct adaptation scenarios, they, respectively, meet the requirements of high reliability and lightweight deployment.

The length and position of the AFH are compatible with the MAC protocol, but it redefines the fields of destination and source MAC address fields by replacing them with GPU IDs and other content. SUE offers a variety of packet formats based on its AFH header, with SUE’s AFH Gen2 header further divided into normal and compressed formats. As shown in Figure 6e, the hop count and entropy fields are removed in the compressed format, and the GPU ID is halved. This shortens the AFH header length from 12 bytes (in the normal format) to 6 bytes, and the remaining fields are opaque to the data link layer during packet forwarding, which significantly reduces transmission overhead. In terms of the PDU length, the maximum size of SUE’s packed PDU is 4096 bytes, while that of SUE Lite’s packed PDU is only 1024 bytes. SUE Lite’s overall packet length is notably smaller than SUE’s, making it more suitable for lightweight scenarios, such as edge-side GPU interconnection, where requirements for hardware resource consumption and transmission latency are more stringent.

With the shared memory model, SUE enables efficient memory semantics for scale-up GPU clusters through a unified memory model and enhanced Ethernet transmission. At the memory layer, SUE builds a unified virtual global memory, encapsulating distributed memory resources (including local host memory, GPU memory, remote node memory, etc.) into a single virtual address space. The unified virtual address is packed into the SUE packet along with the GPU IDs and the memory operations. Working with the scale-up network and address translation units (ATUs) or memory management units (MMUs) in GPUs, GPUs within the same scale-up domain can directly access the memory in other devices without CPU-GPU data copying.

SUE features flexible deployment capabilities. In terms of topology, SUE supports single-layer switching topologies as shown in Figure 10(Ca) or multi-layer switching topologies like Figure 10(Cc), and direct mesh connection topologies like Figure 10(Da), while accommodating varying port configuration rates to meet the needs of both large-scale clusters (up to 1024 or 4096) and small-scale low-latency scenarios. Based on 112G/224G SerDes, SUE can achieve a single-port transmission rate of 100 Gbps up to 800 Gbps, and each SUE instance can support multiple port configurations. Additionally, to address requirements for data transmission isolation and resource security in multi-tenant scenarios, SUE reserves a 10-bit field in its RH specifically for multi-tenant isolation. Combined with VC priority partitioning and independent scheduling mechanisms, this ensures that traffic from different tenants or services does not interfere with each other. Moreover, SUE provides two command interfaces, an AXI4-based interface and a signal-based interface, to adapt to different GPU NoCs.

Figure 10. Topology overview (C:CPU, G:GPU, and S:Switch). (A) Tree-based topology: (a) pass through, (b) balance, (c) common, (d) cascade, and (e) hybrid [,]. (B) Cube-based topology: (a) cube-mesh []; (b) 2D-Tours []. (C) Switch-based topology: (a) 1-layer switch [,,,], (b) backbone [,], and (c) 2-layer switch [,]. (D) P2P-based directly fully connected topology: (a) directed connected [], (b) directing routing [], and (c) nD-FullMesh [].

Despite the above design, SUE also promotes the transmission efficiency and reliability with many other technical mechanisms. To improve the bandwidth utilization and payload ratio, SUE relies on an opportunistic transaction packing strategy in the mapping and packing layer. Different from UALink’s TL Flit packing rules, SUE only packs transactions of the same type targeting the same GPU into a single PDU within the restricted length, and this is not mandatory. At the link layer, SUE integrates the IEEE 802.1Qbb-based PFC and the Ultra Ethernet Consortium (UEC) specification-based CBFC [], enabling fine-grained traffic control for different scenarios. Moreover, Link-Level Retry (LLR) is used to improve the reliability beyond the FEC by retransmitting dropped or incorrect packets. At the physical layer, SUE adopts a low-latency, lightweight FEC mechanism (such as RS-272) and supports non-interleaved or less interleaved modes of FEC blocks across lanes, which achieves a balance between error correction capability and processing latency. Additionally, SUE uses a two-level load balancing mechanism to optimize traffic distribution. Based on congestion status, SUE can efficiently distribute transaction data across different SUE instances on the GPU and different ports of these SUE instances.

In summary, SUE enables efficient memory semantics and interconnection for GPU clusters based on Ethernet, integrating the universality of the Ethernet ecosystem with the performance advantages of NVLink or UALink and improving the coordination efficiency of heterogeneous computing resources while ensuring compatibility. Meanwhile, Broadcom’s ultra-high-performance and ultra-low-latency switch chip products, such as Tomahawk Ultra (51.2 Tb/s throughput; 250 ns forwarding latency) [], have further guaranteed the implementation and deployment of SUE while providing high-performance guarantees.

4.5. Others

In addition to the new protocols introduced earlier, many GPU or switch chip manufacturers are also actively exploring and making breakthroughs in intra-node GPU interconnection, such as AMD Infinity Fabric Link [,], MTLink [] from Moor Threads, Blink [] from Biren Technology, etc. These protocols have features such as low protocol overhead, high transmission efficiency, and high bandwidth, etc. Taking the BR100 [] GPU product from Biren Technology as an example, on the basis of implementing CPU-GPU interconnection with PCIe Gen5×16, it adopts the Blink to interconnect GPUs. It achieves a single-card interconnection bandwidth of up to 448 GB/s and supports the full interconnection of eight cards in a single node with the topology shown in Figure 10(Da). The MTT S4000 [] GPU product from Moor Threads achieves a single-card interconnection bandwidth of up to 240 GB/s through MLink.

However, these protocol technologies have not been publicly released and have not had a significant impact on the industry on a large scale. Moreover, due to the limited scalability of the interconnection topology and the absence of dedicated protocol-supporting switch chips, most proprietary protocols are difficult to be applied in larger-scale HBDs. Therefore, in this article, we will not elaborate on their discussion any further.

5. Comparisons, Discussions, and Insights

In this chapter, we compare and discuss these interconnect technologies across key dimensions. In fact, although there are differences in packet format defined or specific technologies implementations among different protocols, the core design ideas and overall directions are gradually converging.

Overall, we believe that the design of the protocol is not inherently good or bad. Each protocol and its internal mechanisms have their own applicable scenarios and considerations. Table 5 conducts a multi-dimensional comparison of these emerging protocols. Since many GPU chips or switch chip products associated with emerging protocols are still in the design or manufacturing stage, many protocols cannot yet be fully and quantitatively compared in specific aspects. Hence, we do not limit our analysis to a superficial comparison of specific parameters, protocol features, or the granular implementation details discussed earlier; instead, we delve into in-depth, expansive analyses of critical technologies such as packet design, topology, flow control, retransmission, and network-computing coordination. We expect to explore the underlying design principles and philosophies through this comparison and discussion while articulating our technical insights in the analyses.

Table 5. Comparison of scale-up protocols.

5.1. Evaluation Parameters

Given the emerging nature of intra-node GPU interconnection in scale-up networks, there is no industry-unified evaluation standard yet. While different protocols or interconnection technologies vary in their design objectives, clarifying key assessment dimensions is crucial for technical analysis and scenario adaptation evaluation. From the perspective of requirements and applications, the key parameters of an interconnection technology or protocol should include the port bandwidth it supports, end-to-end latency, scalability, protocol overhead, and transmission efficiency, as well as the deployment costs (such as area and power consumption) of the protocol engine on the chip. Among these, some of the former parameters may be related to specific demand scenarios, while the latter are likely associated with hardware implementation and manufacturing processes.

5.2. Protocol Scope

Interconnect protocols are designed around specific scenarios and application requirements, leading to distinct differences in their connected objects, scope, and core pursuits which are presented in Table 6.

Table 6. Comparison of protocol scope.

PCIe is engineered for general-purpose interconnection, enabling communication between diverse devices within a server. It primarily targets CPU-peripheral connections, including CPU-GPU, CPU-NIC, CPU-Memory, etc., and it covers short-range domains from on-board to in-rack. Its core pursuit is versatility, ensuring compatibility with various peripheral types. In contrast, RoCE is tailored to the long-range network-level interconnection with RDMA-enabled devices, which is better suited for scale-out scenarios. To support scalability across hosts, racks, and data centers, it leverages Ethernet infrastructure to deliver flexible coverage.

The new scale-up technologies are tailored for the high performance of intra-node GPU interconnection and the collaboration within the scale-up domain. They optimize GPU-to-GPU data sharing for greater efficiency and directness, enabling a single GPU to access the high-bandwidth memory (HBM) of other GPUs in the same system directly. These GPU-to-GPU links relieve bandwidth pressure on each CPU’s PCIe uplink and eliminate the need for data to be routed through system memory or cross-CPU links. With fully connected topologies, all GPU-to-GPU traffic is handled via direct links like NVLink by P2P communication. While these new interconnect technologies provide high-speed interconnection channels between GPUs, PCIe-based heterogeneous connections remain essential. In the scale-up domain, it still relies on PCIe to connect GPUs to CPUs, supporting critical operations such as configuration, task distribution, control, and maintenance currently.

5.3. Communication Semantics

Message semantics and memory semantics are two core paradigms for inter-device communication in HPC and AI systems. These two semantics represent two fundamentally different communication philosophies in parallel computing and have important applications in GPU clusters and data center interconnection. As illustrated in Table 7, both message and memory semantics have their own pros and cons, distinguished by their design logic and application scenarios. Although some GPUs used message semantics in the earlier stages, currently the majority of GPUs and new protocols mainly adopt the memory semantics model.

Table 7. Comparison between message semantics and memory semantics.

Message semantics relies on explicit message passing, where GPUs exchange data by sending and receiving discrete packets of messages via RDMA or TCP encapsulation. It transfers data in complete message units, favoring large data blocks (typically a few KB to several MB), requiring pre-allocated send or receive buffers, and each message has clear start and end boundaries. This paradigm enables flexible scale-out expansion for distributed independent server nodes like Ethernet-based AI clusters but introduces additional protocol overhead and microsecond-level communication latency. It requires programmers to explicitly invoke send/receive functions to drive communication, and each computing unit maintains an independent memory address space. In addition, it supports non-blocking asynchronous communication, making it well-suited for loosely coupled tasks under distributed memory architectures, such as data parallelism in AI training.

In contrast, NVLink, OISA, UALink, and SUE adopt the memory semantics, which abstract inter-GPU communication as access to a global shared memory space, enabling multiple GPUs to exchange data as if they were accessing the same memory region. It supports a transparent memory access mechanism, where data transmission manifests as implicit communication and programmers do not need to concern themselves with the physical location of data. No matter the data resides in the memory of a local GPU or a remote GPU, it can be directly accessed through a unified memory address. By emulating native memory operations (e.g., load, store, and atomic operations) and eliminating redundant network protocol encapsulation, memory semantics reduces latency to the nanosecond level and boosts bandwidth to hundreds of GB/s or even TB/s. It also supports flexible data transfer ranging from the byte level and cache line level (32–128 bytes) to several GB of continuous memory blocks. This paradigm is optimized for scale-up scenarios, where GPUs within a server node are tightly interconnected to support fine-grained, low-latency memory access and seamless convergence of computing and communication, like tensor parallelism or pipeline parallelism in LLMs. Aligned with computing-centric programming models (e.g., GPU load/store instructions) and shared-memory architectures, memory semantics excels in tightly coupled, high-bandwidth, low-latency scenarios.

5.4. Packet Format

In the design of high-performance interconnection protocols, packet format is crucial. It is not only the carrier of various functional features of the protocol but also one of the key identifiers distinguishing different protocols. We have summarized the packet formats of different protocols in Figure 6, from which we can gain insights into their differences in design philosophies.

Firstly, a compact, lightweight packet header is a shared pursuit. Unlike scale-out interconnection protocols such as RoCEv2 (Figure 6g), which have a heavy protocol layer and large packet headers, all these emerging scale-up interconnection protocols distinguish themselves in packet design with a streamlined protocol layer and minimal header overhead. From the perspective of efficiency, GPU memory semantic transactions like load/store are tied to the cacheline size, and typically operate in 64-byte or 128-byte units, making excessive protocol overhead unacceptable. Meanwhile, network environmental factors further enable such simplification. Compared to scale-out or Ethernet networks, the scale-up domain features a smaller interconnection scale, shorter transmission distance, more stable and deterministic environment, and lower networking complexity, all of which allow for more streamlined addressing, routing, and protection fields in packets. Practical examples and these new protocols underscore this trend. For example, the control half-flit of UALink is 32 bytes and accommodates several transaction operations, the total TLP header of OISA is approximately 20 bytes, and SUE’s AI Fabric Header is compressed to only 6 bytes.

Beyond this shared pursuit, different protocols diverge in other design philosophies, as reflected in whether their packet structures adopt a layered design and whether they are Ethernet-compatible. Traditional Internet protocols, rooted in the OSI (Open System Interconnect)-layered architecture, feature protocol headers with clear inter-layer demarcations and underlying protocol type indications. While NVLink, UALink, and SUE clearly adhere to this strict layered paradigm by separating data link (DL) headers from transaction layer (TL) headers, some newer protocols (e.g., OISA) do not adopt such strict layering in packet formats, with their forwarding-related and transaction-related fields consolidated, resulting in more compact packets. On the other hand, these protocols also diverge in their approaches to compatibility with Ethernet forwarding layers. Designs represented by SUE and OISA retain the protocol type field in bytes 13 and 14, which identifies packet types and indicates parsing rules. In contrast, the flit header designs like NVLink and UALink lack this field. The former are compatible with most existing Ethernet-based switch chips, meaning new packet types can be forwarded by adding protocol-specific support without altering the overall architecture; the latter may require dedicated switch engine designs.

5.5. Interconnect Topology

Interconnection topology is a core element in the HBD GPU interconnect design, whose selection directly determines key performance metrics such as communication efficiency, transmission latency, and system scalability. To make a comprehensive comparison and discussion about topology, we categorize and summarize major topology types in Figure 10.

During the early stages of GPU scale-up interconnection, PCIe emerged as a prevalent approach employing a tree-based topology. As illustrated in Figure 10A, the tree-based topology features a hierarchical structure with diverse configurations, scaling hierarchically around CPUs or PCIe switches. As discussed in Section 3.1, owing to inherent limitations in path flexibility, bandwidth aggregation at upper-layer nodes, and the PCIe lane count on CPUs or PCIe switches, this topology fails to support large-scale direct GPU-to-GPU communication.

Constrained by PCIe’s performance and scalability, GPU manufacturers turned to direct inter-GPU interconnection topologies alongside proprietary protocols, such as the cube-based topology (Figure 10B) and the direct-connected topology (Figure 10D). These designs work for small-scale interconnections but face notable limitations when scaled further. Cube-based topologies offer structured connectivity but lack direct paths between all GPU pairs. For instance, in Figure 10B, the two orange-marked GPUs must relay through a CPU or other GPUs, introducing extra latency and performance overhead. Additionally, the P2P interconnection performance is unbalanced. In [,], researchers evaluated the P100-based DGX-1 machines with NVLink-V1 and V100-based DGX-1 with NVLink-V2, and the results show that the communication latency and bandwidth have clear disparity in neighbor GPUs and remote GPUs P2P communication. Direct fully connected topology (Figure 10(Da)) excels in point-to-point efficiency but suffers from severe scalability issues: with n GPUs, there are n (n − 1) paths. This exponential growth in connections creates significant challenges to chip design, area, and power consumption as systems scale up further.

Given these circumstances, switch-based topologies have gained researchers’ attention, and the emerging technologies like NVLink, OISA, UALink, SUE, etc., all constructed their large scale-up domain based switch chips. By leveraging switches for flexible, high-bandwidth links, these topologies effectively address the scalability and direct communication shortcomings of tree-based or small-scale designs. NVIDIA firstly introduced its NVSwitch into NVLink-based interconnections, aiming to build a fully connected topology where each GPU could interconnect with others in a single hop. The introduction of switches has significantly expanded the scale of the HBD from 8 or 16 GPUs to 64, 128, or even up to 1024. Additionally, switching forwarding latency is only a few hundred nanoseconds, far lower than that of the intermediate CPUs or GPUs. These factors have led to switch-based topology now being adopted by most new protocol designs.

However, switch chips vary across protocols and vendors in capacities, rate, port counts, and other parameters, leading to ongoing debates about how to build large-scale switch-based HBDs. In most cases, single-layer switch-based fully connected topologies (e.g., Figure 10(Ca)) are prioritized for their network simplicity, forwarding determinism, and non-blocking interconnection, while two-layer designs (e.g., Figure 10(Cb),(Cc)) introduce more challenges in routing, addressing, load balancing, and ordering. Yet constrained by the switch capabilities and limitations of current infrastructure in space, power, and distance, two-layer networks are still considered in special scenarios. However, this divergence does not hinder the industry from moving toward a shared direction for pursuing switch-based topology and enhancing the switch capabilities in bandwidth, latency, radix, and other key attributes. Moreover, a novel nD-FullMesh topology (Figure 10(Dc)) was proposed by []. It constructs through recursive expansion with both low-radix and high-radix switches, where adjacent nodes on a single circle are directly connected to each other, forming a tightly coupled direct-connection domain across all network tiers.

5.6. Flow Control

Lossless transmission is the cornerstone of GPU collaborative computing in the scale-up domain. Flow control serves as a core safeguard for lossless data transmission, preventing network congestion and packet loss, and it has long been a hot topic in the field of network research. Taking into account the specific network environment and the efficient implementation in the scale-up domain, the flow control usually takes the way of per-hop flow control, which can generally be classified into the Ethernet flow control (e.g., priority-based flow control (PFC)) and credit-based flow control (CBFC).

PFC is a per-hop per-service-class flow control scheme, which is proposed by IEEE 802.1Qbb. PFC supports quality of service (QoS) with defining a 3-bit Priority Code Point (PCP) field to define eight priority levels, and each priority level defines a unique class of service (CoS) []. Each CoS has its virtual channel (VC) and VC queue buffer in each port. When packets enter an input port, they are placed in different ingress queues of the input port based on their priorities, and each VC gives the packet sender back-pressures according to the utilization of the VC queue independently. As shown in Figure 11a, when the utilization of VC1 of the receiver surpasses the threshold of Xoff, a PFC pause packet is sent to the sender and the sender would pause the transmission of data; when the utilization of the receiver (e.g., VC3) decreases to the threshold of Xon, it would generate a PFC resume packet to the sender, and the sender would restart to send packets. However, since it takes time to transmit the pause or resume packets, PFC needs to reserve an extra queue buffer for the fighting packets during this period, which leads to greater storage consumption. Additionally, the PFC still suffers from problems such as unfairness [], head-of-line blocking, the congestion spreading problem, and deadlock []. Therefore, apart from the PFC mechanism, new protocol designs are generally considering the CBFC mechanism.

Figure 11. Overview of (a) Priority-based Flow Control (PFC) and (b) Credit-based Flow Control (CBFC).

Credit-based flow control have been firstly proposed for ATM (Asynchronous Transfer Mode) networks [,] and is also used in the scale-out network [,]. CBFC is an optimized alternative to PFC, which operates hop-by-hop in the network to provide more fine-grained flow control than can be provided by PFC. It achieves flow control by pre-allocating the queue buffer for each VC and representing the remaining buffer space with credit values. Both the receiver and the sender need to maintain credit counters to track the available buffer space for each VC. Before each data packet transmission, the sender needs to first check whether the receiver has sufficient space to receive this packet. Only when the receiver has sufficient buffer space will the port scheduler of the sender be allowed to schedule data packets for transmission from the VC queue. When the data is sent out, the sender will deduct the credit value they have consumed. After the receiver has processed the data and released the buffer space, it will return the credit value to the sender for continued transmission. As shown in Figure 11b, when the remaining credits of VC3 are not enough to send the next packet, which means the remaining buffer of the receiver’s VC3 queue is not enough, the VC3 queue is paused and will not send the packet until enough credits are freed.

CBFC eliminates packet loss caused by congestion in the receiver buffer and realizes the reliable transmission. Since there is no need to consider the fighting packets on the link, CBFC requires less buffer consumption than PFC. Moreover, CBFC is not sensitive to link latency, packet size, or the sender’s response time. If the link latency is underestimated, PFC may result in buffer overflow and packet loss, but CBFC may lead to insufficient link utilization when the credit release is not timely. The drawback of these advantages of CBFC is that it has a more complex implementation mechanism. The PFC only requires the receiver to provide feedback based on its own buffer status, while the CBFC requires both the sender and receiver to track the receiver’s buffer space, maintain multiple credit counters, and continuously synchronize their credit values to prevent credit leakage and ensure the correct operation of the system.

Overall, both have their own advantages and disadvantages, and are suitable for different situations. In the introduced emerging protocols, almost all of them support CBFC or similar solutions (e.g., OISA’s BFC), and OISA and SUE also support the PFC mechanism, providing the most basic flow control technology guarantee.

5.7. Retransmission

While flow control mechanisms strive to prevent congestion and packet loss, data packets may still be lost or suffer from bit flips during transmission due to unstable link conditions. The retransmission mechanism is thus essential to ensure that any data loss or errors would be promptly detected and recovered through retransmission. Yet the introduction of retransmission inevitably adds overhead to both data packets and processing engines. On one hand, packets require additional fields as a sequence number to enable the localization of erroneous or lost packets. On the other hand, senders must allocate an extra retransmission buffer to cache transmitted packets until the acknowledgment signals are received, at which point the packets in the retransmission buffer could be released.

As Table 8 shows, the reliable retransmission methods could be divided into three categories, each adapted to different communication scenarios. Go back 0 (GB0) or stop-and-wait protocol is a simple mechanism, which sends one frame and must wait for an acknowledgment (ACK) before proceeding, and only retransmits that frame in case of timeout or frame loss []. Go back N (GBN) can transmit N frames consecutively without waiting for immediate ACKs; if one frame is lost, it retransmits the lost frame and all subsequent frames that have been sent but not yet acknowledged, starting from the lost frame [,]. Selective retransmission (SR) also supports consecutive transmission of N frames, but only retransmits the exact lost frame when frame loss is detected, without involving other frames that have been transmitted normally [,].

Table 8. Comparison of retransmission mechanisms.

In the scale-up domain, GBN and SR are the most widely adopted retransmission methods for efficiency, while retransmission functions are typically deployed in either the transaction layer or data link layer. These two deployment locations, along with the choice of GBN or SR, involve distinct cost–benefit trade-offs. Both GBN and SR rely on a packet sequence number (PSN) to enable retransmission. The length of the PSN field is determined by factors such as the link rate, packet length, transmission distance, and RTT delay, and it usually introduces an overhead of several bytes in the packet header. The transaction-layer retransmission (TLR) operates at a higher level in the protocol stack, which is primarily triggered by random errors not continuous, such as data payload corruption or out-of-order transmission, rather than link-level failures. To avoid excessive bandwidth waste from retransmission, the TLR usually adopts the method of SR. However, data-layer retransmission (DLR) is fundamentally different from TLR, as it targets link-induced issues, such as the data loss or errors caused by physical link problems, not internal protocol stack failures. Because link errors tend to recur continuously, GBN is the preferred method for DLR. Moreover, GBN’s simpler design is more efficient than SR’s complexity for addressing such recurring issues.

Currently, the latest versions of OISA, UALink, and SUE all support data-layer retransmission, and the transaction-layer retransmission is supported in the earlier version or another few protocol designs. However, many designers deemed both TLR and DLR necessary in the scale-up network. Yet as protocols have evolved, most now opt to retain DLR and make TLR optional, which is driven by three key characteristics of modern scale-up domain designs. First, the scale-up environment is relatively stable. Especially in current single-layer switch-based topology, the deterministic interconnection and IO consistency inherently avoid network-induced out-of-order transmission, significantly reducing the need for TLR. Second, the transaction-layer fields and data payloads of most protocols stay unmodified during forwarding. The switch forwards packets based on the protocol header information in the data link layer, and the higher layer information is transparent to switches. This immutability makes additional transaction-layer protection or retransmission redundant. Third, TLR or SR introduces significant chip area overhead and design complexity, yet its triggering probability is extremely low. And the overhead of the PSN for TLR is explicit in the packet header, not like the DLR PSN that could be hidden in the packet preamble field. This high-cost, low-benefit trait has rendered TLR no longer a mandatory requirement for most emerging scale-up protocols.

5.8. Packing Efficiency

Although scale-up protocols have been designed to streamline packet structures and minimize protocol overhead as much as possible, link bandwidth efficiency in scale-up networks remains a critical concern. While large-scale computing data transmission between GPUs in the scale-up domain can be achieved via DMA, there remains large high-frequency, small-sized packets (e.g., read requests, write completions, atomic operation acknowledgments, etc.). These small packets carry little payload information or only control information. Thus, even the already streamlined protocol headers account for a substantial portion of their overall overhead. Furthermore, the forwarding architectures of some switch chips are optimized for processing large packets but may fail to achieve wire-speed forwarding when handling high-frequency short packets. The combination of these two effects severely constrains the overall performance of systems relying on high-frequency, fine-grained data interactions.

In [,], the researcher proposed FinePack design which dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. This design concept is incorporated into most protocols designs with operations such as flit packing or packet aggregation for transactional data packets. For instance, the UALink improves the overall payload rate of data packets by packing multiple TL flits together, placing multiple requests, responses, and flow controls within a single control half-flit. SUE adopts opportunistic transaction packing and OISA proposes TLP aggregation. The core of these mechanisms lie in merging multiple short packets into a single long packet for transmission. The switch forwards the aggregated packets and the receiver GPU performs the reverse disaggregation, recovering and processing the multiple transaction packets individually. These approaches increase the average length of physical data packets on the link, amortize the overhead of multiple independent packet headers, reduce the protocol header overhead in memory operations, and ultimately improve effective bandwidth utilization and overall performance of the scale-up domain.

5.9. Coordination of Network and Computing

While scale-up networks and switch chips are essentially just technical means to enhance fast memory access between GPUs and accelerate collaborative computing, their role in multi-GPU systems has become far more critical in practical implementations. This critical role is reflected in two aspects. On one hand, through more intelligent network awareness and scheduling mechanisms, they proactively avoid congestion and packet loss during data transmission, smoothing the way for efficient collaboration between GPUs. On the other hand, beyond enabling high-performance data forwarding and system scaling via switch chips, it is further necessary to drive in-depth collaboration between switch chips and GPUs to achieve higher-quality computing synergy.

5.9.1. Intelligent Network Sensing

In data center networks, a variety of methods are available to detect network state and implement congestion control, such as in-band network telemetry (INT) [,,], active queue management (AQM) [,], delay or RTT-based congestion control [,], etc. However, constrained by the requirements for streamlined packet design and transmission efficiency, these methods often introduce obvious protocol overhead or processing latency, which make them ill-suited for the scale-up network.

OISA proposed an intelligent sensing access mode to collect a variety of network state data along the way during data packet transmission by inserting 2-byte sensing tags into the packets. Combined with its designed feedback mechanism, the sender GPU can obtain network state information about links, network nodes, and peer GPUs on the transmission path in real time and accurately identify the location of congested nodes. After obtaining the relevant information, the sender GPU could implement intelligent feedback through analyzing the network state, adjusting the subsequent data transmission rate or modifying the transmission path to avoid congestion. At present, the other protocols have not introduced such sensing or detection mechanisms explicitly.

5.9.2. In-Network Compute

Beyond the collaborative computing between GPUs, many protocols are now pushing for the synergy between GPUs and switch chips within the scale-up domain. This requires switch chips to possess the computing capability, enabling the offloading of computing tasks in collective communication to effectively reduce the computing burden of GPUs and the network burden.

As shown in Figure 12, in the absence of a switch, traditional collective communication is typically implemented either via a parameter GPU (Figure 12a) or through Ring-Reduce between GPUs (Figure 12b). This approach not only consumes a significant portion of the GPUs’ computing power for collective communication but also leads to frequent transmission of computing data and intermediate results between GPUs, resulting in substantial communication overhead. After introducing a switch chip without computing power (Figure 12c), this situation will not be improved.

Figure 12. Overview of (a) AllReduce via parameter GPU, (b) Ring AllReduce, (c) AllReduce in switch-based network, (d) AllReduce via in-network computation.

Meanwhile, in Figure 12d, the switch with computing power is introduced to accelerate collective communication. By having switch chips perform data computations and return results directly, the communication traffic generated by the exchange of computing data and intermediate results between GPU nodes is reduced. This prevents network transmission from becoming a performance bottleneck, thereby shortening the processing latency of overall tasks, achieving computing acceleration, and ultimately enhancing the overall performance of GPU clusters.

However, as the LLMs evolve and the all-to-all traffic during model training and inference surges, the gains brought by in-network computation (INC) need to be reevaluated. Therefore, in the current protocol design, NVLink supports its SHArP technology with NVSwitch, and OISA proposes its collective communication acceleration technology. The current versions of UALink and SUE protocols have not introduced this mechanism yet.

6. Future Directions

From the analysis, the requirements and challenges, and the comparison and discussion about the existing solutions for scale-up networks in previous sections, we can observe that in some aspects different emerging solutions have reached consensus and are optimizing in the same direction. However, with the emergence of new demands and the expansion of the scale-up network, there are still many technical and engineering issues that need to be explored or resolved. Notably, even though existing advanced solutions have proposed innovative ideas in protocol and technical design, their practical implementation and large-scale deployment are still deeply constrained by multiple practical factors. Specifically, the key challenges faced by current advanced scale-up network technologies are summarized as follows:

Physical constraints: Traditional topologies are limited by switch port counts, chip area, and power consumption, failing to meet ultra-large-scale GPU cluster full-interconnection needs. Multi-layer switching topologies may be an evolution, but many existing protocols have limitations in adapting to this structure. Additionally, copper-based interconnections are constrained by transmission distance and signal attenuation, blocking performance scaling.
Narrow protocol scope: Existing advanced protocols focus primarily on GPU-GPU or GPU-Switch interconnection, ignoring interconnection demands of other heterogeneous components required for integrated computing ecosystems.
System-level engineering bottlenecks: Practical deployment of scale-up interconnection technologies (e.g., SuperPod) involves coordinated operation of CPUs, GPUs, switch chips, storage, power supply, and thermal management systems, with some complex system-level engineering issues remaining unresolved.
Lack of dedicated tools: Dedicated simulation kits and test instruments for scale-up GPU interconnection are scarce, hindering rapid technology verification and iteration.

Based on these challenges, in this section, we outline several directions that we consider worthy of exploration or urgent addressing in the near future.

6.1. Topology Evolution

As the number of GPUs per node in scale-up systems surges and demands for higher bandwidth and lower latency grow, the currently prevalent single-layer switch-based fully connected topology may struggle to meet future needs. Constrained by the port density of switch chips, the single-layer topology restricts interconnection scale, creating bottlenecks when handling massive parallel data flows among hundreds of GPUs. Hence, in the future, the topology may shift to two-layer switch-based topology or more complex topology architectures, which can accommodate larger GPU clusters by distributing traffic across hierarchical layers and expanding port capacity [].

Yet these evolved topologies bring new challenges that require in-depth consideration. First, the increased inter-switch links create multiple data transmission paths, which demand advanced multi-path forwarding mechanisms to fully utilize network resources and avoid underutilization of redundant links. Second, the traffic patterns call for intelligent load balancing strategies that dynamically adjust routing based on real-time link congestion and traffic priorities to prevent localized bottlenecks. Third, multi-path transmission often causes data out-of-order issues that disrupt GPU communication consistency, requiring targeted solutions. Additionally, protocol design must adapt flexibly to changes in application demands and network topologies to ensure compatibility and maintain high performance. These efforts will help scale-up networks break through single-layer switch constraints and support larger GPU clusters efficiently in the future.

6.2. Protocol Scope Expansion

Current scale-up interconnection protocols are largely optimized for intra-node GPU interconnection. We think that future research in scale-up protocol should broaden the protocol scope to encompass GPU to CPU links, GPU access to host memory directly, etc., so that all major components within a server node collaborate under a unified set of communication and memory semantics. For GPU-CPU, future protocols should move beyond basic PCIe-style transactions toward higher-bandwidth, lower-latency, and preferably cache-coherent links that expose a unified address space and predictable latency, thereby removing the CPU-GPU control and data-transfer bottlenecks that hinder large-model training [,]. For host memory, protocols should treat CPU DRAM as a managed extension of GPU memory, supporting fine-grained, zero-copy access, demand paging and prefetching, and QoS-aware placement to reduce redundant copies and mitigate HBM capacity pressure without sacrificing throughput. Based on the expansion of the protocol scope, it enables seamless and close coordination among all components within the entire node, which is highly beneficial for enhancing the performance of the entire GPU cluster.

6.3. SuperPod

As the core computing unit in the AI era, the SuperPod connects multiple GPUs into a logical large GPU via a high-bandwidth and low-latency scale-up network. It is a pivotal development direction to break through the bottlenecks of computing power and interconnection and drive the transformation of intelligent computing servers toward a high-density computing and storage hardware form. Currently, representative products such as NVIDIA’s NVL72 SuperPod [] and Huawei’s CloudMatrix384 GPU SuperPod [] solution have demonstrated technical potential in scenarios like high-density computing power integration and support for large-scale model training. However, to achieve the integrated orchestration of computing power, switch, memory, energy efficiency, and reliability, a series of engineering challenges remain to be tackled [,,,]. These include the ultimate optimization of bandwidth and latency for high-density interconnection, the management and coordinated scheduling of multiple types of components (e.g., CPUs, GPUs, Switches, Memory, etc.), fault recovery, compatibility with the existing infrastructures, the large-scale engineering reliability of liquid cooling and power supply systems, etc. These aspects will also become the core research directions for the maturation and large-scale deployment of SuperPod technology in the future.

6.4. Chiplet

For GPUs and switching chips, when supporting new protocols, in addition to the traditional method of integrating the corresponding IP, we think that chiplet technology should be taken into account as well in the near future. As a pivotal direction for advancing scale-up networks, it integrates multi-GPU computing units and memory modules into heterogeneous chiplets, interconnected via high-speed on-package links, enabling fine-grained resource disaggregation, flexible recomposition, and low-latency, high-bandwidth inter-die communication [,,]. This facilitates a unified memory space across GPUs, breaking memory isolation between discrete chips and addressing the memory wall in large-scale multi-GPU collaboration. Moreover, compared to conventional IP-integrated designs, chiplet enables flexible technology evolution. Computing chiplets use advanced process nodes for peak performance, while IO chiplets balance cost and cycles, reducing die area and boosting yield for faster iterations.

6.5. Optical Interconnection

Current intra-node GPU interconnections in scale-up scenarios still mainly rely on electrical connections such as copper cables, which have obvious bottlenecks. Their bandwidth density is low, which makes it difficult to support high-concurrency data interaction in clusters with 64 or more GPUs, and their signal attenuation is fast. Even over short distances, additional compensation circuits are required, which increases latency and power consumption [,]. Meanwhile, their transmission distance is limited, which severely restricts the layout flexibility of in-node devices and the expansion of interconnection scale.

Optical interconnection addresses these pain points with unique advantages. It leverages the high bandwidth density, low signal attenuation, and low latency of optical signals, and replacing traditional electrical interconnections with optical links in intra-node GPU connections can significantly raise the bandwidth ceiling, as single-link throughput can reach several Tbps []. It also eliminates the need for complex signal compensation, which further reduces transmission latency. More importantly, optical signals overcome the distance limitation of electrical signals, and their performance remains nearly undiminished over tens of meters. This effectively breaks the physical boundary constraints of traditional servers and provides a more scalable solution for interconnecting large-scale GPU clusters in intelligent computing scenarios.

6.6. Simulation Kits and Test Instruments

Another critical future direction is the development of dedicated simulation environment kits and testing instruments for the scale-up domain. In scale-out and data center network fields, mature tools already exist, such as the open-source NS-3 [] platform for network scenario simulation, and specialized testing instruments like traffic generators and WAN emulators, which are widely used for protocol validation, traffic modeling, and performance benchmarking. However, scale-up networks have unique characteristics that focus on multi-GPU low-latency collaboration, heterogeneous device interconnection (GPU-CPU; GPU-Switch), and fine-grained traffic (e.g., synchronization signals; small parameter transfers), which make existing tools inadequate. Ordinary traffic generators also fail to generate packets of the new scale-up protocol, simulate the dynamic traffic patterns of multi-GPU clusters, and accurately measure the sub-microsecond-level latency.

Thus, dedicated simulation kits and test instruments are essential: First, a scale-up-oriented simulation platform that integrates GPU interconnect topology models, protocol stacks, and heterogeneous device interaction logic. Second, a testing kit including fine-grained traffic generators and real-time latency and bandwidth monitors. Third, tools with scalability to simulate the clusters with hundreds of GPUs. These tools will accelerate protocol research and iteration, reduce physical test costs, and provide reliable validation for scale-up network optimization.

7. Conclusions

This review explores the status, challenges, and future directions of intra-node GPU interconnection in scale-up networks, a critical field driven by the exponential growth of AI applications such as LLM training and inference. Among the core challenges are the stringent demands for extremely high bandwidth, ultra low latency, and large-scale interconnection, alongside issues including efficient protocol design and lossless network establishment, all of which directly constrain system performance in data-intensive, latency-sensitive scenarios.

We systematically review the design of the traditional interconnect protocols like PCIe, as well as various emerging protocols tailored for intra-node GPU interconnection, discussing their underlying design philosophies and key considerations while offering critical insights and perspectives. Future research should focus on the evolution of scale-up networks and the emerging technologies such as chiplet and optical interconnection which are critical to breaking through the existing physical constraints that currently limit interconnection performance. As the first review article in this emerging domain, this work can serve as a foundational reference, guiding subsequent research on scale-up protocol optimization and engineering implementations of scale-up GPU systems.

Author Contributions

Conceptualization, J.C. and K.L.; investigation, X.S. and X.Z. (Xuxia Zhong); writing—original draft preparation, X.S.; writing—review and editing, X.S., D.Z. and X.Z. (Xiaoguang Zhang); supervision: H.Z. and X.Z. (Xiaoguang Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Municipal Science & Technology Commission and the project of the Research and Development of the Intra-node Scale-up Interconnect Protocol in SuperPod (Project No. RZ241100004224022).

Conflicts of Interest

Authors were employed by the company China Mobile and declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

References

Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Duan, J.; Zhang, S.; Wang, Z.; Jiang, L.; Qu, W.; Hu, Q.; Wang, G.; Weng, Q.; Yan, H.; Zhang, X.; et al. Efficient Training of Large Language Models on Distributed Infrastructures: A Survey. arXiv 2024, arXiv:2407.20018. [Google Scholar] [CrossRef]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Team, K.; Bai, Y.; Bao, Y.; Chen, G.; Chen, J.; Chen, N.; Chen, R.; Chen, Y.; Chen, Y.; Chen, Y.; et al. Kimi K2: Open Agentic Intelligence. arXiv 2025, arXiv:2507.20534. [Google Scholar] [CrossRef]
Li, A.; Song, S.L.; Chen, J.; Liu, X.; Tallent, N.; Barker, K. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In Proceedings of the 2018 IEEE International Symposium on Workload Characterization (IISWC), Raleigh, NC, USA, 30 September–2 October 2018; pp. 191–202. [Google Scholar] [CrossRef]
Li, A.; Song, S.L.; Chen, J.; Li, J.; Liu, X.; Tallent, N.R.; Barker, K.J. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 94–110. [Google Scholar] [CrossRef]
Huang, X.; Wang, J. Inter-Data Center RDMA: Challenges, Status, and Future Directions. Future Internet 2025, 17, 242. [Google Scholar] [CrossRef]
Xu, L.; Anthony, Q.; Zhou, Q.; Alnaasan, N.; Gulhane, R.; Shafi, A.; Subramoni, H.; Panda, D.K.D. Accelerating Large Language Model Training with Hybrid GPU-based Compression. In Proceedings of the 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Philadelphia, PA, USA, 6–9 May 2024; pp. 196–205. [Google Scholar] [CrossRef]
Liao, X.; Sun, Y.; Tian, H.; Wan, X.; Jin, Y.; Wang, Z.; Ren, Z.; Huang, X.; Li, W.; Tse, K.F.; et al. mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training. arXiv 2025, arXiv:2501.03905. [Google Scholar]
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar] [CrossRef]
Zhang, S.; Zheng, N.; Lin, H.; Jiang, Z.; Bao, W.; Jiang, C.; Hou, Q.; Cui, W.; Zheng, S.; Chang, L.W.; et al. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. arXiv 2025, arXiv:2502.19811. [Google Scholar]
Sharma, D.D. PCI-Express: Evolution of a Ubiquitous Load-Store Interconnect Over Two Decades and the Path Forward for the Next Two Decades. IEEE Circuits Syst. Mag. 2024, 24, 47–61. [Google Scholar] [CrossRef]
Chen, C.; Di, W.; Yerong, T.; Yanli, Z. Research on the Development Status of High Speed Interconnection Technologies and Topogies of Multi-GPU Systems. Aero Weapon. 2024, 31, 23–31. [Google Scholar]
Gouk, D.; Kang, S.; Lee, S.; Kim, J.; Nam, K.; Ryu, E.; Lee, S.; Kim, D.; Jang, J.; Bae, H.; et al. CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies. IEEE Micro 2025, 1–8. [Google Scholar] [CrossRef]
Sharma, D.D. Compute Express Link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In Proceedings of the 2022 IEEE Symposium on High-Performance Interconnects (HOTI), Virtual, 17–19 August 2022; pp. 5–12. [Google Scholar] [CrossRef]
NVIDIA. NVIDIA NVLink. 2025. Available online: https://www.nvidia.com/en-us/data-center/nvlink/ (accessed on 6 October 2025).
AMD. Infinity Fabric (IF)—AMD. 2025. Available online: https://en.wikichip.org/wiki/amd/infinity_fabric (accessed on 6 October 2025).
Schieffer, G.; Shi, R.; Markidis, S.; Herten, A.; Faj, J.; Peng, I. Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric. In Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 17–22 November 2024; pp. 567–576. [Google Scholar] [CrossRef]
ChinaMobile. OISA. 2025. Available online: https://www.oisa.org.cn/ (accessed on 6 October 2025).
UALink Consortium™. 2025. Available online: https://ualinkconsortium.org/ (accessed on 15 August 2025).
Broadcom. Scale-Up Ethernet Framework Specification. 2025. Available online: https://docs.broadcom.com/doc/scale-up-ethernet-framework (accessed on 15 August 2025).
Ma, S.; Ma, T.; Chen, K.; Wu, Y. A Survey of Storage Systems in the RDMA Era. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4395–4409. [Google Scholar] [CrossRef]
Guo, Z.; Liu, S.; Zhang, Z.L. Traffic Control for RDMA-Enabled Data Center Networks: A Survey. IEEE Syst. J. 2020, 14, 677–688. [Google Scholar] [CrossRef]
An, W.; Bi, X.; Chen, G.; Chen, S.; Deng, C.; Ding, H.; Dong, K.; Du, Q.; Gao, W.; Guan, K.; et al. Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning. arXiv 2024, arXiv:2408.14158. [Google Scholar]
Wang, W.; Ghobadi, M.; Shakeri, K.; Zhang, Y.; Hasani, N. Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters. In Proceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI), Albuquerque, NM, USA, 21–23 August 2024; pp. 1–10. [Google Scholar] [CrossRef]
Qian, K.; Xi, Y.; Cao, J.; Gao, J.; Xu, Y.; Guan, Y.; Fu, B.; Shi, X.; Zhu, F.; Miao, R.; et al. Alibaba HPN: A Data Center Network for Large Language Model Training. In Proceedings of the ACM SIGCOMM 2024 Conference, Sydney, Australia, 4–8 August 2024; pp. 691–706. [Google Scholar] [CrossRef]
Das Sharma, D.; Blankenship, R.; Berger, D. An Introduction to the Compute Express Link (CXL) Interconnect. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
Chen, C.; Zhao, X.; Cheng, G.; Xu, Y.; Deng, S.; Yin, J. Next-Gen Computing Systems with Compute Express Link: A Comprehensive Survey. arXiv 2025, arXiv:2412.20249. [Google Scholar]
Hu, J.; Shen, H.; Liu, X.; Wang, J. RDMA Transports in Datacenter Networks: Survey. IEEE Netw. 2024, 38, 380–387. [Google Scholar] [CrossRef]
Arsid, R. Ultra Ethernet and UALink: Next-Generation Interconnects for AI Infrastructure. IJSAT Int. J. Sci. Technol. 2025, 16, IJSAT25023103. [Google Scholar]
Liao, H.; Liu, B.; Chen, X.; Guo, Z.; Cheng, C.; Wang, J.; Chen, X.; Dong, P.; Meng, R.; Liu, W.; et al. UB-Mesh: A Hierarchically Localized nD-FullMesh Datacenter Network Architecture. IEEE Micro 2025, 45, 20–29. [Google Scholar] [CrossRef]
NVIDIA. NVIDIA DGX SuperPOD. 2025. Available online: https://www.nvidia.com/en-us/data-center/dgx-superpod/ (accessed on 12 November 2025).
Ahn, J.; Choi, S.; Shin, T.; Lee, J.; Yoon, J.; Kim, K.; Son, K.; Suh, H.; Kim, T.; Park, H.; et al. Design and Analysis of Ultra High Bandwidth (UHB) Interconnection-based GPU-Ring for the AI Superchip Module. In Proceedings of the 2024 IEEE 33rd Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Toronto, ON, Canada, 6–9 October 2024; pp. 1–3. [Google Scholar] [CrossRef]
Arfeen, D.; Mudigere, D.; More, A.; Gopireddy, B.; Inci, A.; Ganger, G.R. Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training. arXiv 2025, arXiv:2504.06095. [Google Scholar]
AFL. AI Data Centers: Scaling UP and Scaling Out. 2025. Available online: https://www.aflhyperscale.com/wp-content/uploads/2024/12/AI-Data-Centers-Scaling-Up-and-Scaling-Out-White-Paper.pdf (accessed on 12 November 2025).
Kalyanasundharam, N. Introducing UALink 200G 1.0 Specification. 2025. Available online: https://ualinkconsortium.org/wp-content/uploads/2025/04/UALink-1.0-White_Paper_FINAL.pdf (accessed on 15 August 2025).
Tarraga-Moreno, J.; Escudero-Sahuquillo, J.; Garcia, P.J.; Quiles, F.J. Understanding intra-node communication in HPC systems and Datacenters. arXiv 2025, arXiv:2502.20965. [Google Scholar] [CrossRef]
Qi, H.; Dai, L.; Chen, W.; Jia, Z.; Lu, X. Performance Characterization of Large Language Models on High-Speed Interconnects. In Proceedings of the 2023 IEEE Symposium on High-Performance Interconnects (HOTI), Virtual, 23–25 August 2023; pp. 53–60. [Google Scholar] [CrossRef]
Si, M.; Balaji, P.; Chen, Y.; Chu, C.H.; Gangidi, A.; Hasan, S.; Iyengar, S.; Johnson, D.; Liu, B.; Ren, R.; et al. Collective Communication for 100k+ GPUs. arXiv 2025, arXiv:2510.20171. [Google Scholar]
Zhao, C.; Deng, C.; Ruan, C.; Dai, D.; Gao, H.; Li, J.; Zhang, L.; Huang, P.; Zhou, S.; Ma, S.; et al. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA), Tokyo, Japan, 21–25 June 2025; pp. 1731–1745. [Google Scholar]
Bittner, R.; Ruf, E. Direct GPU/FPGA Communication via PCI Express. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW), Pittsburgh, PA, USA, 10–13 September 2012; pp. 135–139. [Google Scholar] [CrossRef]
Foley, D.; Danskin, J. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 2017, 37, 7–17. [Google Scholar] [CrossRef]
Muthukrishnan, H.; Lustig, D.; Villa, O.; Wenisch, T.; Nellans, D. Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 516–529. [Google Scholar]
Chu, C.H.; Hashmi, J.M.; Khorassani, K.S.; Subramoni, H.; Panda, D.K. High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems. In Proceedings of the 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), Hyderabad, India, 17–20 December 2019; pp. 267–276. [Google Scholar] [CrossRef]
De Sensi, D.; Pichetti, L.; Vella, F.; De Matteis, T.; Ren, Z.; Fusco, L.; Turisini, M.; Cesarini, D.; Lust, K.; Trivedi, A.; et al. Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Atlanta, GA, USA, 17–22 November 2024. [Google Scholar] [CrossRef]
Shou, C.; Liu, G.; Nie, H.; Meng, H.; Zhou, Y.; Jiang, Y.; Lv, W.; Xu, Y.; Lu, Y.; Chen, Z.; et al. InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers. In Proceedings of the ACM SIGCOMM 2025 Conference, Coimbra, Portugal, 8–11 September 2025; pp. 1–23. [Google Scholar] [CrossRef]
Saber, M.G.; Jiang, Z. Physical Layer Standardization for AI Data Centers: Challenges, Progress and Perspectives. IEEE Netw. 2025. [Google Scholar] [CrossRef]
Zuo, P.; Lin, H.; Deng, J.; Zou, N.; Yang, X.; Diao, Y.; Gao, W.; Xu, K.; Chen, Z.; Lu, S.; et al. Serving Large Language Models on Huawei CloudMatrix384. arXiv 2025, arXiv:2506.12708. [Google Scholar] [CrossRef]
Sharma, D.D. Compute Express Link (CXL): Enabling Heterogeneous Data-Centric Computing With Heterogeneous Memory Hierarchy. IEEE Micro 2023, 43, 99–109. [Google Scholar] [CrossRef]
Hong, M.; Xu, L. BR100 GPGPU: Accelerating Datacenter Scale AI Computing. In Proceedings of the 2022 IEEE Hot Chips 34 Symposium (HCS), Cupertino, CA, USA, 21–23 August 2022; pp. 1–22. [Google Scholar] [CrossRef]
PCI-SIG. Available online: https://pcisig.com/ (accessed on 6 October 2025).
NVIDIA. NVIDIA H100 Tensor Core GPU Architecture Whitepaper. 2022. Available online: https://www.nvidia.cn/lp/data-center/resources/download-hopper-arch-whitepaper/ (accessed on 6 October 2025).
AMD. Data Sheet—AMD Instinct™ MI300X Accelerator. 2023. Available online: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf (accessed on 6 October 2025).
CXL. CXL-3.1-Specification. 2024. Available online: https://computeexpresslink.org/wp-content/uploads/2024/02/CXL-3.1-Specification.pdf (accessed on 15 August 2025).
Wang, X.; Liu, J.; Wu, J.; Yang, S.; Ren, J.; Shankar, B.; Li, D. Performance Characterization of CXL Memory and Its Use Cases. In Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Milano, Italy, 3–7 June 2025; pp. 1048–1061. [Google Scholar] [CrossRef]
Wang, Z.; Luo, L.; Ning, Q.; Zeng, C.; Li, W.; Wan, X.; Xie, P.; Feng, T.; Cheng, K.; Geng, X.; et al. SRNIC: A Scalable Architecture for RDMA NICs. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–29 April 2023; pp. 1–14. [Google Scholar]
Sun, Z.; Guo, Z.; Ma, J.; Pan, Y. A High-Performance FPGA-Based RoCE v2 RDMA Packet Parser and Generator. Electronics 2024, 13, 4107. [Google Scholar] [CrossRef]
Star Oceans Wiki. NVLink. 2022. Available online: http://www.staroceans.org/wiki/A/NVLink (accessed on 7 August 2025).
NVIDIA. NVIDIA GB200 NVL72. 2025. Available online: https://www.nvidia.com/en-us/data-center/gb200-nvl72/ (accessed on 6 October 2025).
NVIDIA. NVIDIA GB300 NVL72. 2025. Available online: https://www.nvidia.com/en-us/data-center/gb300-nvl72/ (accessed on 6 October 2025).
Danskin, J.; Foley, D. Pascal GPU with NVLink. In Proceedings of the 2016 IEEE Hot Chips 28 Symposium (HCS), Cupertino, CA, USA, 21–23 August 2016. [Google Scholar]
NVIDIA. NVIDIA Tesla P100. 2016. Available online: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf (accessed on 6 October 2025).
NVIDIA. NVIDIA DGX-1 with Tesla V100 System Architecture. 2017. Available online: https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf (accessed on 6 October 2025).
Ishii, A.; Foley, D. NVSwitch and DGX-2—NVIDIA’s NVLink-Switching Chip and Scale-Up GPU-Compute Server. In Proceedings of the 2018 IEEE Hot Chips 30 Symposium (HCS), Cupertino, CA, USA, 19–21 August 2018. [Google Scholar]
NVIDIA. Unified Memory for CUDA Beginners. 2017. Available online: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ (accessed on 6 October 2025).
Graham, R.L.; Bureddy, D.; Lui, P.; Rosenstock, H.; Shainer, G.; Bloch, G.; Goldenerg, D.; Dubman, M.; Kotchubievsky, S.; Koushnir, V.; et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. In Proceedings of the 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), Salt Lake City, UT, USA, 13–18 November 2016; pp. 1–10. [Google Scholar] [CrossRef]
Ramesh, B.; Kuncham, G.K.R.; Suresh, K.K.; Vaidya, R.; Alnaasan, N.; Abduljabbar, M.; Shafi, A.; Subramoni, H.; Panda, D.K.D. Designing In-network Computing Aware Reduction Collectives in MPI. In Proceedings of the 2023 IEEE Symposium on High-Performance Interconnects (HOTI), Virtual, 23–25 August 2023; pp. 25–32. [Google Scholar] [CrossRef]
Graham, R.L.; Levi, L.; Burredy, D.; Bloch, G.; Shainer, G.; Cho, D.; Elias, G.; Klein, D.; Ladd, J.; Maor, O.; et al. Scalable hierarchical aggregation and reduction protocol (sharp) tm streaming-aggregation hardware design and evaluation. In Proceedings of the International Conference on High Performance Computing, Frankfurt am Main, Germany, 22–25 June 2020; pp. 41–59. [Google Scholar]
NVIDIA Corporation. NVIDIA NVLink Fusion. 2025. Available online: https://www.nvidia.cn/data-center/nvlink-fusion/ (accessed on 6 August 2025).
Onufryk, P. UALink 200G 1.0 Specification Overview. 2025. Available online: https://staging.ualinkconsortium.org/wp-content/uploads/2025/04/UALink-1.0-Specification-Webinar_FINAL.pdf (accessed on 15 August 2025).
Brown, D.; Lusted, K. UALink 200G 1.0 Specification Overview: Data Link Layer (DL) and Physical Layer (PL). 2025. Available online: https://www.ieee802.org/3/ad_hoc/E4AI/public/25_0624/lusted_e4ai_01_250624.pdf (accessed on 15 August 2025).
Norrie, T.; Patil, N.; Yoon, D.H.; Kurian, G.; Li, S.; Laudon, J.; Young, C.; Jouppi, N.P.; Patterson, D. Google’s Training Chips Revealed: TPUv2 and TPUv3. In Proceedings of the 2020 IEEE Hot Chips 32 Symposium (HCS), Palo Alto, CA, USA, 16–18 August 2020; pp. 1–70. [Google Scholar] [CrossRef]
NVIDIA. NVIDIA DGX A100 User Guide. 2025. Available online: https://docs.nvidia.com/dgx/dgxa100-user-guide/introduction-to-dgxa100.html (accessed on 10 November 2025).
NVIDIA. NVIDIA DGX-2. 2025. Available online: https://www.nvidia.com/en-in/data-center/dgx-2/ (accessed on 10 November 2025).
Intel. Intel Gaudi 3 AI Accelerator. 2025. Available online: https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html (accessed on 10 November 2025).
Ultra Ethernet Consortium ™. Ultra Ethernet ™ Specification v1.0. 2025. Available online: https://ultraethernet.org/wp-content/uploads/sites/20/2025/06/UE-Specification-6.11.25.pdf (accessed on 15 August 2025).
BROADCOM. Tomahawk Ultra/BCM78920 Series 51.2Tb/s StrataXGS Tomahawk Ultra Low-Latency Ethernet Switch Series. 2025. Available online: https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm78920-series (accessed on 6 October 2025).
MOORE THREADS. MTLink-S4000. 2025. Available online: https://www.mthreads.com/product/S4000 (accessed on 15 August 2025).
Wang, S.Y.; Chen, Y.R.; Hsieh, H.C.; Lai, R.S.; Lin, Y.B. A Flow Control Scheme Based on Per Hop and Per Flow in Commodity Switches for Lossless Networks. IEEE Access 2021, 9, 156013–156029. [Google Scholar] [CrossRef]
Bahnasy, M.; Elbiaze, H. Fair Congestion Control Protocol for Data Center Bridging. IEEE Syst. J. 2019, 13, 4134–4145. [Google Scholar] [CrossRef]
Wu, X.C.; Eugene Ng, T.S. Detecting and Resolving PFC Deadlocks with ITSY Entirely in the Data Plane. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; pp. 1928–1937. [Google Scholar] [CrossRef]
Kung, N.; Morris, R. Credit-based flow control for ATM networks. IEEE Netw. 1995, 9, 40–48. [Google Scholar] [CrossRef]
Anderson, T.E.; Owicki, S.S.; Saxe, J.B.; Thacker, C.P. High-speed switch scheduling for local-area networks. ACM Trans. Comput. Syst. 1993, 11, 319–352. [Google Scholar] [CrossRef]
Li, L.; Chen, Y.; Lu, H.; He, L.; Gao, L.; Wang, N. Credit-R: Enhancing Credit-Based Congestion Control in Cross-Data Center Networks. In Proceedings of the 2024 10th International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2024; pp. 1458–1463. [Google Scholar] [CrossRef]
Malhotra, A.; Chitre, K. Performance analysis of data link layer protocols with a special emphasis on improving the performance of Stop-and-Wait-ARQ. In Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 16–18 March 2016; pp. 593–597. [Google Scholar]
Hayashida, Y.; Sugimachi, N.; Komatsu, M.; Yoshida, Y. Go-back-N system with limited retransmissions. In Proceedings of the Eighth Annual International Phoenix Conference on Computers and Communications, Scottsdale, AZ, USA, 22–24 March 1989; pp. 183–187. [Google Scholar] [CrossRef]
Lee, T.H. The throughput efficiency of go-back-N ARQ scheme for burst-error channels. In Proceedings of the IEEE INFCOM ’91. The conference on Computer Communications. Tenth Annual Joint Comference of the IEEE Computer and Communications Societies Proceedings, Bal Harbour, FL, USA, 7–11 April 1991; Volume 2, pp. 773–780. [Google Scholar] [CrossRef]
Rati Preethi, S.; Kumar, P.; Anil, S.; Chandavarkar, B.R. Predictive Selective Repeat—An Optimized Selective Repeat for Noisy Channels. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, M.; Guo, Z.; Sun, S. RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs. IEEE Comput. Archit. Lett. 2025, 24, 269–272. [Google Scholar] [CrossRef]
Muthukrishnan, H. Improving Multi-GPU Strong Scaling Through Optimization of Fine-Grained Transfers. Ph.D. Thesis, University of Michigan, Ann Arbor, MI, USA, 2022. [Google Scholar]
Tan, L.; Su, W.; Miao, J.; Zhang, W. FindINT: Detect and Locate the Lost in-Band Network Telemetry Packet. IEEE Netw. Lett. 2022, 4, 20–24. [Google Scholar] [CrossRef]
Wang, H.; Liu, Y.; Li, W.; Yang, Z. Multi-Agent Deep Reinforcement Learning-Based Fine-Grained Traffic Scheduling in Data Center Networks. Future Internet 2024, 16, 119. [Google Scholar] [CrossRef]
Li, Y.; Miao, R.; Liu, H.H.; Zhuang, Y.; Feng, F.; Tang, L.; Cao, Z.; Zhang, M.; Kelly, F.; Alizadeh, M.; et al. HPCC: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, Beijing, China, 19–24 August 2019; pp. 44–58. [Google Scholar] [CrossRef]
Huang, H.; Xue, G.; Wang, Y.; Zhang, H. An adaptive active queue management algorithm. In Proceedings of the 2013 3rd International Conference on Consumer Electronics, Communications and Networks, Xianning, China, 20–22 November 2013; pp. 72–75. [Google Scholar] [CrossRef]
Guo, Y.; Meng, Z.; Wang, B.; Xu, M. Inferring in-Network Queue Management from End Hosts in Real-Time Communications. In Proceedings of the ICC 2024—IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; pp. 3389–3395. [Google Scholar] [CrossRef]
Mittal, R.; Lam, V.T.; Dukkipati, N.; Blem, E.; Wassel, H.; Ghobadi, M.; Vahdat, A.; Wang, Y.; Wetherall, D.; Zats, D. TIMELY: RTT-based Congestion Control for the Datacenter. SIGCOMM Comput. Commun. Rev. 2015, 45, 537–550. [Google Scholar] [CrossRef]
Kumar, G.; Dukkipati, N.; Jang, K.; Wassel, H.M.G.; Wu, X.; Montazeri, B.; Wang, Y.; Springborn, K.; Alfeld, C.; Ryan, M.; et al. Swift: Delay is Simple and Effective for Congestion Control in the Datacenter. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, Amsterdam, The Netherlands, 22–26 August 2020; pp. 514–528. [Google Scholar] [CrossRef]
Tithi, J.J.; Wu, H.; Abuhatzera, A.; Petrini, F. Scaling Intelligence: Designing Data Centers for Next-Gen Language Models. arXiv 2025, arXiv:2506.15006. [Google Scholar] [CrossRef]
Cui, S.; Patke, A.; Nguyen, H.; Ranjan, A.; Chen, Z.; Cao, P.; Bode, B.; Bauer, G.; Martino, C.D.; Jha, S.; et al. Characterizing GPU Resilience and Impact on AI/HPC Systems. arXiv 2025, arXiv:2503.11901. [Google Scholar]
Gujar, V. Chiplet Technology: Revolutionizing Semiconductor Design-A Review. Saudi J. Eng. Technol. 2024, 9, 69–74. [Google Scholar] [CrossRef]
Li, C.; Jiang, F.; Chen, S.; Li, X.; Liu, J.; Zhang, W.; Xu, J. Towards Scalable GPU System with Silicon Photonic Chiplet. In Proceedings of the 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 25–27 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Li, C.; Jiang, F.; Chen, S.; Lil, X.; Liu, Y.; Chen, L.; Li, X.; Xu, J. RONet: Scaling GPU System with Silicon Photonic Chiplet. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 28 October–2 November 2023; pp. 1–9. [Google Scholar] [CrossRef]
Cheng, Q.; Glick, M.; Bergman, K. Chapter 18—Optical interconnection networks for high-performance systems. In Optical Fiber Telecommunications VII; Willner, A.E., Ed.; Academic Press: Cambridge, MA, USA, 2020; pp. 785–825. [Google Scholar] [CrossRef]
Yang, C.; Hu, B.; Chen, P.; Liu, Y.; Zhang, W.; Xu, J. BEAM: A Multi-Channel Optical Interconnect for Multi-GPU Systems. In Proceedings of the 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 31 March–2 April 2025; pp. 1–7. [Google Scholar] [CrossRef]
NS-3. NS-3 Network Simulator. Available online: https://www.nsnam.org/ (accessed on 15 August 2025).

Figure 1. The overview of scale-up and scale-out.

Figure 2. The overview of 3D (dimension) parallelism in distributed LLM training, including tensor parallelism (TP), data parallelism (DP), and pipeline parallelism (PP).

Figure 3. (a) Three conceptual models of GPU memory access. (b) Memory spaces without UMA. (c) Unified memory based on UMA.

Figure 4. (a) Overview of PCIe link. (b) Comparison of NRZ and PAM4.

Figure 5. Architecture of NVLink Protocol [].

Figure 6. The packet format overview of (a) NVLink 1.0 Flit, (b) OISA 2.0 TLP, (c) UALink 1.0 Packed TL Flits, (d) UALink 1.0 DL Flit (640B), (e) SUE RM103 Packet, (f) PCIe Packet, (g) RoCEv2 Packet.

Figure 7. Protocol stack of OISA.

Figure 8. (a) UAlink Protocol Stack, (b) UALink Protocol Features [].

Figure 9. Overview of SUE stack []. (SUE Lite stack do not have the layer of transport).

Figure 10. Topology overview (C:CPU, G:GPU, and S:Switch). (A) Tree-based topology: (a) pass through, (b) balance, (c) common, (d) cascade, and (e) hybrid [,]. (B) Cube-based topology: (a) cube-mesh []; (b) 2D-Tours []. (C) Switch-based topology: (a) 1-layer switch [,,,], (b) backbone [,], and (c) 2-layer switch [,]. (D) P2P-based directly fully connected topology: (a) directed connected [], (b) directing routing [], and (c) nD-FullMesh [].

Figure 11. Overview of (a) Priority-based Flow Control (PFC) and (b) Credit-based Flow Control (CBFC).

Figure 12. Overview of (a) AllReduce via parameter GPU, (b) Ring AllReduce, (c) AllReduce in switch-based network, (d) AllReduce via in-network computation.

Table 1. Comparison of the existing surveys.

Ref	Year	PCIe	CXL	RDMA	NVLink	OISA	UALink	SUE	X-Link
[]	2018	✓	✗	✓	✓	✗	✗	✗	✗
[]	2020	✓	✗	✓	✓	✗	✗	✗	✗
[]	2024	✓	✓	✗	✗	✗	✗	✗	✗
[]	2024	✓	✗	✗	✓	✗	✗	✗	✓
[]	2024	✗	✗	✓	✗	✗	✗	✗	✗
[]	2024	✗	✓	✗	✗	✗	✗	✗	✗
[]	2025	✗	✗	✓	✗	✗	✗	✗	✗
[]	2025	✗	✓	✗	✗	✗	✗	✗	✗
[]	2025	✗	✗	✓	✗	✗	✓	✗	✗
Proposed survey		✓	✓	✓	✓	✓	✓	✓	✓

X-Link is uniformly referred to as other proprietary interconnect protocols. ✓ indicates that the reference covers this technology, while ✗ indicates that it does not.

Table 2. Specification of each generation PCIe.

Version	Year	Signaling	Encoding	Max Data Rate	Bidirectional Max Bandwidth (GB/s)
Version	Year	Signaling	Encoding	Per Lane (GT/s)	$\times 1$	$\times 2$	$\times 4$	$\times 8$	$\times 16$
1.0	2003	NRZ	8b/10b	2.5	0.5	1	2	4	8
2.0	2007		8b/10b	5.0	1	2	4	8	16
3.0	2010		128b/130b	8.0	2	4	8	16	32
4.0	2017			16.0	4	8	16	32	64
5.0	2019			32.0	8	16	32	64	128
6.0	2022	PAM-4 (FEC)	1b/1b (Flit Mode)	64.0	16	32	64	128	256
7.0	2025			128.0	32	64	128	256	512
8.0	2028 (Planned)			256.0	64	128	256	512	1024

The effective data bandwidth is calculated by the original transmission data rate of a single lane, the number of lanes in the PCIe links, and the encoding efficiency. It is presented as an approximate value in the table.

Table 3. Comparison of intra-node GPU interconnect technologies.

Feature	PCIe	CXL	RDMA
Technical Positioning	General Peripheral/Device Interconnect	Emerging Compute-Memory Coherent Interconnect	High-Speed Network Interconnect
Semantics	I/O Semantics	Memory Semantics	Message Semantics
Bandwidth	High	High	Extremely High
Latency	Hundreds of ns	Hundreds of ns	Microsecond-level
Protocol Overhead	Low	Medium	High
Scalability	Limited	High	Extremely High
Topology Flexibility	Moderate	High	High
Ecosystem Maturity	High	Developing	High
Typical Use Cases	Small-Scale Multi-GPU Systems, General Computing	Disaggregated Memory Architectures, GPU Memory Expansion and Sharing	Distributed GPU Clusters, Cross-Node Large Model Training

Table 4. Specification of each generation NVIDIA NVLinks and NVLink Switches.

NVLink	NVLink 1.0	NVLink 2.0	NVLink 3.0	NVLink 4.0	NVLink 5.0
Publish Time	2014	2017	2020	2022	2024
Lane Speed	20 Gb/s	25 Gb/s	50 Gb/s	100 Gb/s	200 Gb/s
Lane Num	16	16	8	4	4
Link Speed	40 GB/s	50 GB/s	50 GB/s	100 GB/s	200 GB/s
Link Num	4	6	12	18	18
Signal Mode	NRZ	NRZ	NRZ	PAM4	PAM4
GPU P2P Bidirectional Bandwidth	160 GB/s	300 GB/s	600 GB/s	900 GB/s	1800 GB/s
Compatible NVSwitch	NA	NVSwitch 1.0	NVSwitch 2.0	NVSwitch 3.0	NVSwitch 4.0
NVSwitch-Connected GPU Num	NA	8	8	8	72
NVSwitch Total Bandwidth	NA	2.4 TB/s	4.8 TB/s	7.2 TB/s	130 TB/s
Supported GPU Architecture	Pascal	Volta™	Ampere	Hopper™	Blackwell
GPU Product	P100	V100	A100	H200	GB200/GB300

Table 5. Comparison of scale-up protocols.

	NVLink 5.0 []	OISA 2.0 []	UAlink 1.0 []	SUE []
First Proposed Time	2016	2024	2025	2025
Latest Update/Publish ${Time}^{1}$	2024	2025	2025	2025
Latest ${Version}^{1}$	5.0	2.0	1.0	RM103
Standard/Proprietary	Proprietary	Emerging Standard	Emerging Standard	Emerging Standard
Leading Organization	NVIDIA	China Mobile	UALink Consortium	BROADCOM
Semantic Supported	Memory	Memory	Memory	Memory
Scale	NA	Up to 1024	Up to 1024	Up to 4096
Switch	NVSwitch	OISA ${Switch}^{2}$	UALink ${Switch}^{2}$	SUE Switch (TH-Ultra [])
Packet Format	Flit-based	Frame based	Flit-based	Frame based
Packet Length	Variable	Variable	Fixed	Variable
Packing	Flits Packing	TLP Aggregation	TL&DL Flits Packing	Transactions Packing
IPG	No	Yes	No	Yes
Preamble	No	Yes	No	Yes
INC Supported	SHArP	CCA	${Unsupported}^{3}$	NA
Traffic Management	NA	QoS, FC, Intelligent Sensing	QoS, FC	QoS, FC
Flow Control	NA	PFC & BFC	CBFC	PFC & CBFC
PHY Layer	NA	ETH PHY	ETH PHY	ETH PHY
Retransmission	NA	DLR	LLR	LLR
FEC	NA	RS544 & RS272	RS544	RS544 & RS272
Ecosystem Maturity	High	Medium	Medium	Medium

¹: Until August 2025. ²: Related chips are yet to be launched. ³: It might be supported in future versions.

Table 6. Comparison of protocol scope.

Protocol	Connected Objects	Range	Core Pursuit
PCIe [,]	CPU-peripherals (GPU, HBM, NIC, PCIe Switch, etc.)	Short range: on-board, in-chassis, in-rack	Versatility, adapting to multi-type peripheral interconnection
RDMA [,]	RDMA-enabled devices (GPU, RNIC, Switch, ect.)	Scale-out flexible long range: in-rack, cross-GPU cluster, cross-data center	Long-distance reliable transmission, large-scale network-level interconnection
NVLink []	NVIDIA-ecosystem devices (GPU, CPU, NVSwitch, etc.)	Scale-up range: Within a single machine or GPU cluster	Extremely high bandwidth, ultra-low latency, CPU-GPU-Switch collaboration
OISA []	OISA-ecosystem devices (GPU, Switch, Chiplet, etc.)	Scale-up range: Within a single machine or GPU cluster	Extremely high bandwidth, ultra-low latency, GPU-GPU and GPU-Switch collaboration
UALink []	UALink-ecosystem devices (GPU, Switch, etc.)	Scale-up range: Within a single machine or GPU cluster	Extremely high bandwidth, ultra-low latency, GPU-GPU collaboration
SUE []	SUE-ecosystem devices (GPU, Switch, etc.)	Scale-up range: Within a single machine or GPU cluster	Extremely high bandwidth, ultra-low latency, GPU-GPU collaboration

Table 7. Comparison between message semantics and memory semantics.

Comparison Dimension	Memory Semantics	Message Semantics
Semantics Description	Memory access (e.g., load/store, atomic operations, DMA)	Message passing (e.g., RDMA message, TCP message)
Communication Method	Implicit communication (embedded in load/store operations)	Explicit communication (requires send/receive function calls)
Protocol Overhead	No additional network protocol overhead	Network protocol overhead (e.g., TCP, RDMA)
Memory Architecture	Shared memory (unified global virtual address)	Distributed memory (independent memory per device)
Programming Complexity	Low (local memory-like access, no need to focus on communication details)	High (needs management of message boundaries, buffers, and synchronization)
Latency Characteristic	Nanosecond-level (no protocol overhead, hardware direct connection)	Microsecond-level (affected by RDMA/TCP protocol overhead)
Bandwidth Characteristic	Hundreds of GB/s to TB/s (high bandwidth via hardware direct links)	Hundreds of Gb/s (limited by network protocol encapsulation)
Transmission Granularity	Flexibly supports byte-level, cache line-level (32–128 bytes) to multi-GB continuous blocks	Message-based units, favoring batch transmission at KB to MB levels
Typical Scenarios	Tightly coupled tasks (e.g., tensor parallelism in large models, fine-grained data access)	Loosely coupled tasks (e.g., data parallelism in large models, bulk data transmission)
Scaling Type	Scale-up (tight interconnection within a GPU cluster)	Scale-out (distributed cluster expansion)
Representative Technologies	NVLink, OISA, UALink, SUE	RDMA (RoCEv2), TCP
Data Consistency	Automatically ensured by hardware/system (e.g., cache coherence protocols)	Requires manual synchronization at the application layer
Convergence of Computing and Communication	Deep and seamless	Difficult to achieve
Core Advantages	Low latency, high bandwidth, flexible fine-grained interaction	Strong scalability, low cost, efficient bulk transmission

Table 8. Comparison of retransmission mechanisms.

	Go Back 0 (GB0)	Go Back N (GBN)	Selective Retransmission (SR)
Core Mechanism	Sends 1 frame, waits for ACK before sending next; retransmits the frame on timeout.	Continuously sends N frames with cumulative acknowledgment; retransmits lost frame and all subsequent unacknowledged frames.	Continuously sends N frames with individual acknowledgments; retransmits only the lost frame.
Retransmission Scope	Only the lost frame or the timeout frame.	Lost frame and all subsequent sent but unacknowledged frames.	Only the lost frame or timeout frame (s).
Sender Window Size (Ws)	Ws = 1 (stop-and-wait).	Ws = N (N > 1).	Ws = N (N > 1, usually ≤total sequence numbers/2).
Receiver Window Size (Wr)	Wr = 1 (only in-order reception).	Wr = 1 (only in-order reception; discards out-of-order frames).	Wr = N (supports out-of-order reception; buffers correct frames).
Acknowledgment Method	Individual ACK for each frame.	Cumulative ACK (acknowledges only the last correctly received frame).	Individual ACK for each correctly received frame.
Advantages	Simplest logic; low implementation cost.	Higher efficiency than Go-back 0; low acknowledgment overhead.	Highest channel utilization; minimal retransmission overhead.
Disadvantages	Extremely low channel utilization.	High retransmission overhead; potential bandwidth waste.	Complex logic; higher implementation cost (needs buffering).
Applicable Scenarios	Good channel quality, small data volume, cost-sensitive scenarios.	Good channel quality; low latency sensitivity.	Poor channel quality; high bandwidth utilization requirements.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Survey of Intra-Node GPU Interconnection in Scale-Up Network: Challenges, Status, Insights, and Future Directions

Abstract

1. Introduction and Background

1.1. Literature Review and Our Motivations

1.2. Contributions

2. Requirements, Challenges, and Status of GPU Interconnection in Scale-Up Network

2.1. Requirements

2.2. Challenges

2.3. The Development of GPU Scale-Up

3. Traditional Solutions

3.1. PCIe

3.2. CXL

3.3. Ethernet-Based Solutions

4. Emerging Representative Solutions

4.1. NVLink

4.2. OISA

4.3. UALink

4.4. SUE

4.5. Others

5. Comparisons, Discussions, and Insights

5.1. Evaluation Parameters

5.2. Protocol Scope

5.3. Communication Semantics

5.4. Packet Format

5.5. Interconnect Topology

5.6. Flow Control

5.7. Retransmission

5.8. Packing Efficiency

5.9. Coordination of Network and Computing

5.9.1. Intelligent Network Sensing

5.9.2. In-Network Compute

6. Future Directions

6.1. Topology Evolution

6.2. Protocol Scope Expansion

6.3. SuperPod

6.4. Chiplet

6.5. Optical Interconnection

6.6. Simulation Kits and Test Instruments

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics