HelloCuda 系列第一章: CUDA Overview

The main components of the CUDA programming model include a host (the CPU) and a device (the GPU). The host is responsible for executing sequential parts of the code and managing overall application logic, while the device executes parallel parts of the application, harnessing its massive computational power.

Basic Example

#include <iostream>

__global__ void kernel() {
   int tid = threadIdx.x;
   printf("Thread %d: Hello from GPU!\n", tid);
}

int main() {
   kernel<<<1, 3>>>();  // 1 block, 3 threads
   cudaDeviceSynchronize();
   return 0;
}

NVIDIA GPUs are comprised of Streaming Multiprocessors (SMs). Each SM can manage multiple threads concurrently, making GPUs exceptionally suitable for data-parallel computations where the same operation is applied to multiple data items.

A typical CUDA-enabled GPU contains many SMs, each of which can execute thousands of threads. These threads are organized into blocks, which themselves are organized into grids. Each block runs on a single SM, allowing threads within the block to share data through fast shared memory and synchronize their execution.

想象NVIDIA GPU是一座大型工厂，专门批量生产玩具（处理数据）

GPU组件	工厂比喻
Streaming Multiprocessor (SM)	工厂的装配车间，每个车间可以同时组装多个玩具（并行处理线程）。
Thread（线程）	车间里的工人，每个工人负责组装一个玩具（处理一个数据）。
Block（线程块）	工作组，同一组的工人在同一个车间（SM）协作，共享工具（共享内存）和沟通。
Grid（网格）	整个生产订单，包含多个工作组（Blocks），分配到不同车间（SMs）并行执行。
Warp	GPU硬件调度单位（固定32线程）, 工作组内的装配流水线（强制补到32人，实际执行单位）

dim3 blockSize(128);  // 1个Block有128线程
// GPU硬件会将其自动拆分为：

128线程 / 32线程-per-warp = 4个Warp

避免Warp Divergence（分支分歧）：

    if (threadIdx.x < 16) { 
      // Warp中前16线程走这里
   else {
      // 后16线程走这里 → 导致Warp串行执行两个分支！
   }

// 优化方法：尽量让整个Warp走相同分支（如用threadIdx.x % 32代替直接比较）

CUDA Memory Types

Memory Type	Description	Access Scope	Latency	Caching
Global Memory	Accessible by all threads; high latency.	All threads in the application	High	Uncached
Shared Memory	On-chip memory shared among threads in the same block; much faster than global memory.	Threads within the same block	Low	N/A
Registers	Each thread has its private registers; fastest but limited memory.	Private to each thread	Very Low	N/A
Constant Memory	Cached memory optimized for read-only, uniform access across threads.	All threads, optimized for uniform access	Low (if hit)	Cached
Texture Memory	Cached memory optimized for spatial locality; suitable for image and signal processing.	All threads, optimized for spatial access	Low (if hit)	Cached

Extensive Ecosystem

Tool/Library	Purpose	Key Features
nvcc	NVIDIA CUDA Compiler	Compiles CUDA C/C++ code to PTX and machine code, handles CPU/GPU code separation
cuda-gdb	CUDA Debugger	Debug GPU kernels with breakpoints, memory inspection, and thread-level analysis
nvprof (legacy)	Command-line profiler	Collects GPU metrics (kernel timing, memory ops) with low overhead
Nsight Systems	System-wide performance analyzer	Visualizes GPU/CPU utilization, memory transfers, and kernel dependencies
cuBLAS	CUDA Basic Linear Algebra Subprograms	Optimized matrix ops (GEMM, etc.) with 10x+ speedup over CPU BLAS
cuDNN	CUDA Deep Neural Network library	Accelerated DNN primitives (convolution, LSTM) for frameworks like PyTorch/TensorFlow
NVIDIA Nsight Compute	Kernel-level profiler	Detailed analysis of warp execution, stall reasons, and memory efficiency

CUDA Hello World Program

// hello_world.cu
#include <iostream>
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void helloFromGPU() { 
   printf("Hello World from GPU!\n"); 
} 

int main() { 
   helloFromGPU<<<1, 1>>>(); 
   cudaError_t err = cudaGetLastError(); 
   if (err != cudaSuccess) { 
      std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl; 
      return -1; 
   } 
   cudaDeviceSynchronize(); 
   err = cudaGetLastError(); 
   if (err != cudaSuccess) { 
      std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl; 
      return -1; 
   } 
   std::cout << "Hello World from CPU!" << std::endl; 
   return 0; 
}

编译和运行

nvcc hello_world.cu -o hello_world
./hello_world

THE END