- Published on
HelloCuda 系列 第一章: CUDA Overview
The main components of the CUDA programming model include a host (the CPU) and a device (the GPU). The host is responsible for executing sequential parts of the code and managing overall application logic, while the device executes parallel parts of the application, harnessing its massive computational power.
Basic Example
#include <iostream>
__global__ void kernel() {
int tid = threadIdx.x;
printf("Thread %d: Hello from GPU!\n", tid);
}
int main() {
kernel<<<1, 3>>>(); // 1 block, 3 threads
cudaDeviceSynchronize();
return 0;
}
NVIDIA GPUs are comprised of Streaming Multiprocessors (SMs). Each SM can manage multiple threads concurrently, making GPUs exceptionally suitable for data-parallel computations where the same operation is applied to multiple data items.
A typical CUDA-enabled GPU contains many SMs, each of which can execute thousands of threads. These threads are organized into blocks, which themselves are organized into grids. Each block runs on a single SM, allowing threads within the block to share data through fast shared memory and synchronize their execution.
想象NVIDIA GPU是一座大型工厂,专门批量生产玩具(处理数据)
GPU组件 | 工厂比喻 |
---|---|
Streaming Multiprocessor (SM) | 工厂的装配车间,每个车间可以同时组装多个玩具(并行处理线程)。 |
Thread(线程) | 车间里的工人,每个工人负责组装一个玩具(处理一个数据)。 |
Block(线程块) | 工作组,同一组的工人在同一个车间(SM)协作,共享工具(共享内存)和沟通。 |
Grid(网格) | 整个生产订单,包含多个工作组(Blocks),分配到不同车间(SMs)并行执行。 |
Warp | GPU硬件调度单位(固定32线程), 工作组内的装配流水线(强制补到32人,实际执行单位) |
dim3 blockSize(128); // 1个Block有128线程
// GPU硬件会将其自动拆分为:
128线程 / 32线程-per-warp = 4个Warp
- 避免Warp Divergence(分支分歧):
if (threadIdx.x < 16) {
// Warp中前16线程走这里
else {
// 后16线程走这里 → 导致Warp串行执行两个分支!
}
// 优化方法:尽量让整个Warp走相同分支(如用threadIdx.x % 32代替直接比较)
CUDA Memory Types
Memory Type | Description | Access Scope | Latency | Caching |
---|---|---|---|---|
Global Memory | Accessible by all threads; high latency. | All threads in the application | High | Uncached |
Shared Memory | On-chip memory shared among threads in the same block; much faster than global memory. | Threads within the same block | Low | N/A |
Registers | Each thread has its private registers; fastest but limited memory. | Private to each thread | Very Low | N/A |
Constant Memory | Cached memory optimized for read-only, uniform access across threads. | All threads, optimized for uniform access | Low (if hit) | Cached |
Texture Memory | Cached memory optimized for spatial locality; suitable for image and signal processing. | All threads, optimized for spatial access | Low (if hit) | Cached |
Extensive Ecosystem
Tool/Library | Purpose | Key Features |
---|---|---|
nvcc | NVIDIA CUDA Compiler | Compiles CUDA C/C++ code to PTX and machine code, handles CPU/GPU code separation |
cuda-gdb | CUDA Debugger | Debug GPU kernels with breakpoints, memory inspection, and thread-level analysis |
nvprof (legacy) | Command-line profiler | Collects GPU metrics (kernel timing, memory ops) with low overhead |
Nsight Systems | System-wide performance analyzer | Visualizes GPU/CPU utilization, memory transfers, and kernel dependencies |
cuBLAS | CUDA Basic Linear Algebra Subprograms | Optimized matrix ops (GEMM, etc.) with 10x+ speedup over CPU BLAS |
cuDNN | CUDA Deep Neural Network library | Accelerated DNN primitives (convolution, LSTM) for frameworks like PyTorch/TensorFlow |
NVIDIA Nsight Compute | Kernel-level profiler | Detailed analysis of warp execution, stall reasons, and memory efficiency |
CUDA Hello World Program
// hello_world.cu
#include <iostream>
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void helloFromGPU() {
printf("Hello World from GPU!\n");
}
int main() {
helloFromGPU<<<1, 1>>>();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl;
return -1;
}
cudaDeviceSynchronize();
err = cudaGetLastError();
if (err != cudaSuccess) {
std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl;
return -1;
}
std::cout << "Hello World from CPU!" << std::endl;
return 0;
}
编译和运行
nvcc hello_world.cu -o hello_world
./hello_world
THE END