HelloCuda 系列第三章: CUDA Parallel Programming

The fundamental concept behind parallel programming models is the decomposition of computational tasks into sub-tasks that can be executed concurrently. This decomposition is generally classified into three primary types:

task parallelism,
data parallelism, and
domain decomposition.

In CUDA, task parallelism is typically implemented by launching multiple kernels, each executing a unique task. In CUDA, this is achieved by dividing the data into chunks that are processed in parallel by different threads. Domain decomposition is a strategy where a computational domain is divided into smaller regions, each processed independently.

SIMT

Single Instruction Multiple Threads (SIMT) is a cornerstone of CUDA’s execution model, fundamentally shaping how parallelism is managed and leveraged on NVIDIA’s GPUs. (SIMD (Single Instruction, Multiple Data))

SIMT（Single Instruction, Multiple Threads）是 NVIDIA GPU 架构的核心执行模型，它结合了 SIMD（单指令多数据）和多线程的优势，通过流式多处理器（SM, Streaming Multiprocessor）实现高效并行计算。

Warp divergence occurs when threads within the same warp take different execution paths, typically due to conditional statements (e.g., if-else).

Domain decomposition is a critical technique in parallel computing that refers to dividing a large computational problem into smaller subproblems that can be solved concurrently on multiple processors or threads.

THE END