Published on

HelloCuda 系列 第一章: CUDA Overview

The main components of the CUDA programming model include a host (the CPU) and a device (the GPU). The host is responsible for executing sequential parts of the code and managing overall application logic, while the device executes parallel parts of the application, harnessing its massive computational power.

Basic Example

#include <iostream>

__global__ void kernel() {
   int tid = threadIdx.x;
   printf("Thread %d: Hello from GPU!\n", tid);
}

int main() {
   kernel<<<1, 3>>>();  // 1 block, 3 threads
   cudaDeviceSynchronize();
   return 0;
}

NVIDIA GPUs are comprised of Streaming Multiprocessors (SMs). Each SM can manage multiple threads concurrently, making GPUs exceptionally suitable for data-parallel computations where the same operation is applied to multiple data items.

A typical CUDA-enabled GPU contains many SMs, each of which can execute thousands of threads. These threads are organized into blocks, which themselves are organized into grids. Each block runs on a single SM, allowing threads within the block to share data through fast shared memory and synchronize their execution.

想象NVIDIA GPU是一座​​大型工厂​​,专门批量生产玩具(处理数据)

GPU组件工厂比喻
Streaming Multiprocessor (SM)工厂的装配车间,每个车间可以同时组装多个玩具(并行处理线程)。
Thread(线程)车间里的工人,每个工人负责组装一个玩具(处理一个数据)。
Block(线程块)工作组,同一组的工人在同一个车间(SM)协作,共享工具(共享内存)和沟通。
Grid(网格)整个生产订单,包含多个工作组(Blocks),分配到不同车间(SMs)并行执行。
WarpGPU硬件调度单位(固定32线程), 工作组内的装配流水线(强制补到32人,实际执行单位)
dim3 blockSize(128);  // 1个Block有128线程
// GPU硬件会将其自动拆分为:

128线程 / 32线程-per-warp = 4个Warp
  • 避免Warp Divergence(分支分歧):
    if (threadIdx.x < 16) { 
      // Warp中前16线程走这里
   else {
      // 后16线程走这里 → 导致Warp串行执行两个分支!
   }

// 优化方法:尽量让整个Warp走相同分支(如用threadIdx.x % 32代替直接比较)

CUDA Memory Types

Memory TypeDescriptionAccess ScopeLatencyCaching
Global MemoryAccessible by all threads; high latency.All threads in the applicationHighUncached
Shared MemoryOn-chip memory shared among threads in the same block; much faster than global memory.Threads within the same blockLowN/A
RegistersEach thread has its private registers; fastest but limited memory.Private to each threadVery LowN/A
Constant MemoryCached memory optimized for read-only, uniform access across threads.All threads, optimized for uniform accessLow (if hit)Cached
Texture MemoryCached memory optimized for spatial locality; suitable for image and signal processing.All threads, optimized for spatial accessLow (if hit)Cached

Extensive Ecosystem

Tool/LibraryPurposeKey Features
nvccNVIDIA CUDA CompilerCompiles CUDA C/C++ code to PTX and machine code, handles CPU/GPU code separation
cuda-gdbCUDA DebuggerDebug GPU kernels with breakpoints, memory inspection, and thread-level analysis
nvprof (legacy)Command-line profilerCollects GPU metrics (kernel timing, memory ops) with low overhead
Nsight SystemsSystem-wide performance analyzerVisualizes GPU/CPU utilization, memory transfers, and kernel dependencies
cuBLASCUDA Basic Linear Algebra SubprogramsOptimized matrix ops (GEMM, etc.) with 10x+ speedup over CPU BLAS
cuDNNCUDA Deep Neural Network libraryAccelerated DNN primitives (convolution, LSTM) for frameworks like PyTorch/TensorFlow
NVIDIA Nsight ComputeKernel-level profilerDetailed analysis of warp execution, stall reasons, and memory efficiency

CUDA Hello World Program

// hello_world.cu
#include <iostream>
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void helloFromGPU() { 
   printf("Hello World from GPU!\n"); 
} 

int main() { 
   helloFromGPU<<<1, 1>>>(); 
   cudaError_t err = cudaGetLastError(); 
   if (err != cudaSuccess) { 
      std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl; 
      return -1; 
   } 
   cudaDeviceSynchronize(); 
   err = cudaGetLastError(); 
   if (err != cudaSuccess) { 
      std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl; 
      return -1; 
   } 
   std::cout << "Hello World from CPU!" << std::endl; 
   return 0; 
}

编译和运行

nvcc hello_world.cu -o hello_world
./hello_world

THE END