Term

NVLink

別名: NVLink

Overview

NVIDIAが開発した、GPU間を直接接続するための高帯域・低遅延な通信プロトコルおよびインターフェース。大規模なGPUクラスターにおいて、通信ボトルネックを解消しMFUを向上させるための鍵となる技術。

Research Papers

5 件
  • Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

    Ang Li, S. Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, K. Barker

    2019 287 件引用 Semantic Scholar

    High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

  • 9.3 NVLink-C2C: A Coherent Off Package Chip-to-Chip Interconnect with 40Gbps/pin Single-ended Signaling

    Yingcan Wei, Y. C. Huang, Haiming Tang, N. Sankaran, Ish Chadha, D. Dai, Olakanmi Oluwole, V. Balan, Edward Lee

    2023 35 件引用 Semantic Scholar

    NVLink-C2C is the enabler for Nvidia's Grace-Hopper and Grace Superchip systems, with 900GB/s link between Grace and Hopper, or between two Grace chips. The connection provides a unified, cache-coherent memory address space that combines system and HBM GPU memories for simplified programmability. This coherent, high-bandwidth, low-power, low latency connection between CPU and GPUs is key to accelerating the most complex AI and HPC workloads.

  • The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods

    A. Ishii, Ryan Wells

    2022 28 件引用 Semantic Scholar
  • Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning

    Y. H. Temuçin, A. Sojoodi, Pedram Alizadeh, A. Afsahi

    2021 14 件引用 Semantic Scholar

    High-performance communication for very large messages on modern multi-GPU nodes has become increasingly important for Deep Learning workloads. These computing nodes are equipped with state-of-the-art interconnects, such as Nvidia's NVLink and PCIe, to facilitate communications between GPUs, and GPUs with the host processors. In this paper, we take on the challenge to design efficient intra-socket GPU-to-GPU communication using multiple NVLink channels at the UCX and MPI levels, and then utilise it to design an intra-node hierarchical NVLink/PCIe-aware GPU based MPI_Allreduce to enhance Horovod + TensorFlow with different models. UCX only utilises a small portion of the available NVLink bandwidth for intra-socket GPU-to-GPU communication. We propose a novel data transfer mechanism that stripes the message across multiple intra-socket communication channels and multiple memory regions using multiple GPU streams to utilise all available NVLink paths. Our approach achieves 1.69x and 1.84x higher bandwidth for UCX and Open MPI + UCX, respectively. We observe similar bandwidth improvements for large messages for MPI point-to-point communication when compared to other MPI implementations as they are also limited by data transfers by a single path. We then propose a 3-stage hierarchical, pipelined MPI_Allreduce design that incorporates the new multi-path NVLink data transfer mechanism for intra-socket communications in the first and third stages of the collective, and PCIe and X-bus channels for inter-socket GPU communication in the second stage with minimal interference. For large messages, our proposed algorithm achieves a high speedup when compared to Spectrum MPI, Open MPI + UCX, Open MPI + HPC-X, MVAPICH2-GDR, and NCCL. We also observe significant speedup for the proposed MPI_Allreduce for Horovod with TensorFlow with a variety of Deep Learning models.

  • Towards Memory Disaggregation via NVLink C2C: Benchmarking CPU-Requested GPU Memory Access

    Felix Werner, Marcel Weisgut, T. Rabl

    2025 8 件引用 Semantic Scholar

    Memory disaggregation decouples compute and memory resources, enabling efficient use of resources. Several interconnect technologies provide cache-coherent access to remote memory regions, which eases the use of disaggregated memory. Recent NVIDIA-based systems use the NVLink C2C interconnect, which provides cache-coherent memory access between CPUs and GPUs and their memory. While GPUs and NVLink are widely used to accelerate complex workloads, NVLink’s viability for connecting memory-expansion devices to a CPU remains unexplored. In this work, we quantify the characteristics of NVIDIA’s Grace CPU when accessing GPU memory via NVLink to assess NVLink’s viability for memory expansion. We benchmark throughput and latency for memory accesses on an NVIDIA Grace-Hopper system. We evaluate memory expansion when the CPU accesses both CPU and GPU memory and quantify the performance of database index operations with data stored in GPU memory. Our experiments show a throughput of up to 168 GB/s and access latencies between about 800 ns and 1000 ns.

Mentioned Articles

16 件

External Mentions

4 件