Term

CUDA

別名: CUDA, CUDAコア

Overview

NVIDIAが開発した、GPUを汎用計算(GPGPU)に利用するためのプラットフォーム。AI開発や科学シミュレーションにおいて事実上の標準(デファクトスタンダード)となっており、膨大なライブラリやコミュニティの存在がNVIDIAの強力な競争優位性(エコシステムの壁)を形成している。

Research Papers

5 件
  • Analyzing CUDA workloads using a detailed GPU simulator

    A. Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, T. Aamodt

    2009 1,687 件引用 Semantic Scholar
  • Kevin: Multi-Turn RL for Generating CUDA Kernels

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

    2025 51 件引用 Semantic Scholar

    Writing GPU kernels is a challenging task and critical for AI systems'efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.

  • CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition

    Sumyeong Ahn, Jongwoo Ko, Se-Young Yun

    2023 51 件引用 Semantic Scholar

    Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.

  • CUDA Quantum: The Platform for Integrated Quantum-Classical Computing

    Jin-Sung Kim, Alex McCaskey, Bettina Heim, Manish Modani, Sam Stanwyck, Timothy B. Costa

    2023 43 件引用 Semantic Scholar

    A critical challenge to making quantum computers work in practice is effectively combining them with classical computing resources. From the classical side of hybrid algorithms and integrated application workflows to decoding syndromes for quantum error correction, tightly coupled high performance classical computing will be important for many of the functions required to realize useful quantum computing. A key tool for enabling research and application development is a programming model and software toolchain which allow researchers to straightforwardly co-program classical and quantum computers and leverage the best tools available for each. NVIDIA CUDA Quantum is a single-source programming model in C++ and Python for heterogeneous quantum-classical computing. The CUDA Quantum platform provides several advantages and new capabilities that enable users to get more out of quantum processors. Here, we present CUDA Quantum and demonstrate several use cases including Variational Quantum Eigensolver (VQE) where it provides a significant (287x) performance and capability benefit over existing quantum programming.

  • CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

    Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

    2025 41 件引用 Semantic Scholar

    The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources. Project: deepreinforce-ai.github.io/cudal1_blog

Mentioned Articles

20 件

External Mentions

6 件