Term

NVLink

別名: NVLink

Overview

NVIDIAが開発した、GPU間を直接接続するための高帯域・低遅延な通信プロトコルおよびインターフェース。大規模なGPUクラスターにおいて、通信ボトルネックを解消しMFUを向上させるための鍵となる技術。

Research Papers

5 件

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
Ang Li, S. Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, K. Barker

2019 287 件引用 Semantic Scholar

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.
9.3 NVLink-C2C: A Coherent Off Package Chip-to-Chip Interconnect with 40Gbps/pin Single-ended Signaling
Yingcan Wei, Y. C. Huang, Haiming Tang, N. Sankaran, Ish Chadha, D. Dai, Olakanmi Oluwole, V. Balan, Edward Lee

2023 35 件引用 Semantic Scholar

NVLink-C2C is the enabler for Nvidia's Grace-Hopper and Grace Superchip systems, with 900GB/s link between Grace and Hopper, or between two Grace chips. The connection provides a unified, cache-coherent memory address space that combines system and HBM GPU memories for simplified programmability. This coherent, high-bandwidth, low-power, low latency connection between CPU and GPUs is key to accelerating the most complex AI and HPC workloads.
The Nvlink-Network Switch: Nvidia’s Switch Chip for High Communication-Bandwidth Superpods
A. Ishii, Ryan Wells

2022 28 件引用 Semantic Scholar
Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning
Y. H. Temuçin, A. Sojoodi, Pedram Alizadeh, A. Afsahi

2021 14 件引用 Semantic Scholar

High-performance communication for very large messages on modern multi-GPU nodes has become increasingly important for Deep Learning workloads. These computing nodes are equipped with state-of-the-art interconnects, such as Nvidia's NVLink and PCIe, to facilitate communications between GPUs, and GPUs with the host processors. In this paper, we take on the challenge to design efficient intra-socket GPU-to-GPU communication using multiple NVLink channels at the UCX and MPI levels, and then utilise it to design an intra-node hierarchical NVLink/PCIe-aware GPU based MPI_Allreduce to enhance Horovod + TensorFlow with different models. UCX only utilises a small portion of the available NVLink bandwidth for intra-socket GPU-to-GPU communication. We propose a novel data transfer mechanism that stripes the message across multiple intra-socket communication channels and multiple memory regions using multiple GPU streams to utilise all available NVLink paths. Our approach achieves 1.69x and 1.84x higher bandwidth for UCX and Open MPI + UCX, respectively. We observe similar bandwidth improvements for large messages for MPI point-to-point communication when compared to other MPI implementations as they are also limited by data transfers by a single path. We then propose a 3-stage hierarchical, pipelined MPI_Allreduce design that incorporates the new multi-path NVLink data transfer mechanism for intra-socket communications in the first and third stages of the collective, and PCIe and X-bus channels for inter-socket GPU communication in the second stage with minimal interference. For large messages, our proposed algorithm achieves a high speedup when compared to Spectrum MPI, Open MPI + UCX, Open MPI + HPC-X, MVAPICH2-GDR, and NCCL. We also observe significant speedup for the proposed MPI_Allreduce for Horovod with TensorFlow with a variety of Deep Learning models.
Towards Memory Disaggregation via NVLink C2C: Benchmarking CPU-Requested GPU Memory Access
Felix Werner, Marcel Weisgut, T. Rabl

2025 8 件引用 Semantic Scholar

Memory disaggregation decouples compute and memory resources, enabling efficient use of resources. Several interconnect technologies provide cache-coherent access to remote memory regions, which eases the use of disaggregated memory. Recent NVIDIA-based systems use the NVLink C2C interconnect, which provides cache-coherent memory access between CPUs and GPUs and their memory. While GPUs and NVLink are widely used to accelerate complex workloads, NVLink’s viability for connecting memory-expansion devices to a CPU remains unexplored. In this work, we quantify the characteristics of NVIDIA’s Grace CPU when accessing GPU memory via NVLink to assess NVLink’s viability for memory expansion. We benchmark throughput and latency for memory accesses on an NVIDIA Grace-Hopper system. We evaluate memory expansion when the CPU accesses both CPU and GPU memory and quantify the performance of database index operations with data stored in GPU memory. Our experiments show a throughput of up to 168 GB/s and access latencies between about 800 ns and 1000 ns.

Mentioned Articles

16 件

External Mentions

4 件

Hacker News 25L Portable NV-linked Dual 3090 LLM Rig
▲ 139 tensorlibb 2025年9月19日
Hacker News Show HN: We made glhf.chat – run almost any open-source LLM, including 405B
▲ 161 reissbaker 2024年7月24日
Hacker News Nvidia to Share New Details on Grace CPU, Hopper GPU, NVLink Switch, Jetson Orin
▲ 100 bcaulfield 2022年8月19日
Hacker News NVIDIA Develops NVLink Switch: NVSwitch, 18 Ports For DGX-2
▲ 118 jsheard 2018年3月27日

NVLink

Overview

Research Papers

Mentioned Articles

世界最大55万台のGPUを持つxAI、実は6万台分しか使えていないことが判明

NVIDIAが証明した「エージェント型AI」への不可逆な移行と、巨大資本によるインフラ独占の完成：FY2026第4四半期決算が示す真の構造変化

NVIDIAがCoreWeaveに20億ドルの追加投資：新CPU「Vera」の単体供給と「5ギガワット」のAI要塞構築に向けた戦略的布石

Microsoft「Maia 200」徹底解剖：NVIDIA依存からの脱却と“推論コスト”の劇的削減を狙う戦略的転換点

クラウドGPUの「不都合な真実」：2万基を運用してわかったAWS・GCP・Azureの性能差と監視の鉄則

AI半導体の勢力図、激変へ：SamsungとAMDがOpenAIを核に結託、NVIDIA・SK hynix連合の牙城に挑むHBM4戦略の全貌

PCI Express 8.0の仕様策定が開始：最大1TB/sの実現に向け物理限界に挑戦、今後は光コネクタに移行の可能性も

NVIDIA、次世代Rubin GPUでCoWoPパッケージングを本格検討か：AI時代の新たな基盤技術の深層

NVIDIA、米規制下で中国向け廉価版Blackwell AIチップ「B40」投入か？ H20の半額以下、GDDR7採用で性能はかなり抑え気味

NVIDIAが米国製チップに数千億ドル投資を計画

AMDやIntelら9社、次世代AI向け相互接続規格「UALink」の標準化コンソーシアムを正式に発足

MetaはLlama 3トレーニング中の障害の半数がNVIDIA H100 GPUの頻繁な故障が原因と報告している

AMD、Intelら、NVIDIAのインターコネクト技術に対抗する「UALink」規格開発のため業界団体を設立

NVIDIA、AIスーパーコンピューター9台にGrace Hopperプラットフォームの採用が進んでいることを発表

伝説のチップ設計者Jim Keller氏、NVIDIAがEthernetを採用すればBlackwellの開発費は劇的に削減できたと主張

NVIDIA 次世代AI GPUアーキテクチャ「Blackwell」と新たな「B200」GPUを発表

External Mentions