- Published on
HelloCuda 系列 第二章: CUDA Architecture
A fundamental building block of a GPU is the Streaming Multiprocessor (SM). Each SM contains multiple CUDA cores, also referred to as streaming processors (SPs), which are capable of executing instructions.
- threadIdx Each thread within a block has a unique identifier, accessible via the built-in threadIdx variable.
int idx = threadIdx.x;
- blockIdx Each block within a grid has a unique identifier, accessible via the built-in blockIdx variable. Each block can contain up to 1024 threads, with this limit varying based on the compute capability of the GPU.
int tid = threadIdx.x + blockIdx.x * blockDim.x;
- gridDim The gridDim variable provides the dimensions of the grid, allowing threads to access their position within the entire grid.
Kernel
A CUDA kernel is a function written in C++, augmented with CUDA-specific extensions, that runs on the GPU rather than the CPU. generally, a kernel is defined with the __global__
qualifier, indicating that it can be called from the host (CPU) and executed on the device (GPU).
// kernel_function<<<num_blocks, num_threads_per_block, shared_mem_bytes, stream>>>
__global__ void myKernel() {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// 执行某些操作
}
the __syncthreads() function. This function acts as a barrier at which all threads in a block must wait until every thread reaches that point in the code.
__global__ void helloFromGPU() {
int threadId = threadIdx.x + blockIdx.x * blockDim.x;
printf("【%d】 - Hello World from GPU!\n", threadId);
__syncthreads(); // Ensure all threads in the block have completed before exiting
if (threadId == 0) {
printf("All threads in block %d have completed execution.\n", blockIdx.x);
}
}
CUDA provides each thread block with a small, programmable space called shared memory. This memory is on-chip
__shared__ int sharedData[256]; // 每个Block共享的内存
__global__ void myKernel() {
int tid = threadIdx.x;
sharedData[tid] = tid * tid; // 每个线程计算自己的平方并存储在共享内存中
__syncthreads(); // 确保所有线程都完成写入
// 现在可以使用共享内存中的数据进行进一步计算
}
Best Practices for Designing CUDA Kernels
Memory Access Patterns
- Ensure coalesced memory access to maximize memory bandwidth. (This means that consecutive threads in a warp should access consecutive memory locations.)
- Shared Memory Usage
- Avoid bank conflicts.(Bank conflicts occur when multiple threads access the same memory bank simultaneously, causing serialization of accesses.)
Thread Organization
- Use appropriate block and grid sizes to maximize occupancy.
- Coccupancy Consider the number of threads per block and the number of blocks per grid to ensure that the GPU is fully utilized.
dim3 blockSize = (arraysize + threadsPerBlock.x - 1) / threadsPerBlock.x;
- Avoid Branch Divergence
- use Async memory transfer. (cudaMemcpyAsync)
Memory Management
- Alloc and Free Memory
cudaMalloc((void**)&d_array, size * sizeof(float)); // 分配设备内存
cudaFree(d_array); // 释放设备内存
- Copy Data Between Host and Device
cudaMemcpy(d_array, h_array, size * sizeof(float), cudaMemcpyHostToDevice); // 从主机复制到设备
cudaMemcpy(h_array, d_array, size * sizeof(float), cudaMemcpyDeviceToHost); // 从设备复制到主机
- memory declaration
__device__ int d_var; // 设备变量
__shared__ int s_var; // 共享内存变量
__constant__ int c_var; // 常量内存变量
__managed__ int m_var; // 管理内存变量(统一内存)
surface // 表面内存(用于图像处理)
__global__ void kernel() {
// 在设备代码中使用这些变量
d_var = threadIdx.x; // 每个线程写入自己的ID到设备变量
s_var = threadIdx.x; // 每个线程写入自己的ID到共享内存
}
- Error Handling
cudaError_t err = cudaGetLastError(); // 获取最后一个错误
if (err != cudaSuccess) {
printf("CUDA Error: %s\n", cudaGetErrorString(err));
}
THE END