HelloCuda 系列第二章: CUDA Architecture

A fundamental building block of a GPU is the Streaming Multiprocessor (SM). Each SM contains multiple CUDA cores, also referred to as streaming processors (SPs), which are capable of executing instructions.

threadIdx Each thread within a block has a unique identifier, accessible via the built-in threadIdx variable.

int idx = threadIdx.x;

blockIdx Each block within a grid has a unique identifier, accessible via the built-in blockIdx variable. Each block can contain up to 1024 threads, with this limit varying based on the compute capability of the GPU.

int tid = threadIdx.x + blockIdx.x * blockDim.x;

gridDim The gridDim variable provides the dimensions of the grid, allowing threads to access their position within the entire grid.

Kernel

A CUDA kernel is a function written in C++, augmented with CUDA-specific extensions, that runs on the GPU rather than the CPU. generally, a kernel is defined with the __global__ qualifier, indicating that it can be called from the host (CPU) and executed on the device (GPU).

// kernel_function<<<num_blocks, num_threads_per_block, shared_mem_bytes, stream>>>

__global__ void myKernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // 执行某些操作
}

the __syncthreads() function. This function acts as a barrier at which all threads in a block must wait until every thread reaches that point in the code.

__global__ void helloFromGPU() {
   int threadId = threadIdx.x + blockIdx.x * blockDim.x;
   
   printf("【%d】 - Hello World from GPU!\n", threadId); 

   __syncthreads(); // Ensure all threads in the block have completed before exiting

   if (threadId == 0) {
      printf("All threads in block %d have completed execution.\n", blockIdx.x);
   }
   
}

CUDA provides each thread block with a small, programmable space called shared memory. This memory is on-chip

__shared__ int sharedData[256]; // 每个Block共享的内存
__global__ void myKernel() {
    int tid = threadIdx.x;
    sharedData[tid] = tid * tid; // 每个线程计算自己的平方并存储在共享内存中
    __syncthreads(); // 确保所有线程都完成写入

    // 现在可以使用共享内存中的数据进行进一步计算
}

Best Practices for Designing CUDA Kernels

Memory Access Patterns
- Ensure coalesced memory access to maximize memory bandwidth. (This means that consecutive threads in a warp should access consecutive memory locations.)
- Shared Memory Usage
- Avoid bank conflicts.(Bank conflicts occur when multiple threads access the same memory bank simultaneously, causing serialization of accesses.)
Thread Organization
- Use appropriate block and grid sizes to maximize occupancy.
- Coccupancy Consider the number of threads per block and the number of blocks per grid to ensure that the GPU is fully utilized.
```
dim3 blockSize = (arraysize + threadsPerBlock.x - 1) / threadsPerBlock.x;
```
- Avoid Branch Divergence
- use Async memory transfer. (cudaMemcpyAsync)

Memory Management

Alloc and Free Memory

cudaMalloc((void**)&d_array, size * sizeof(float)); // 分配设备内存
cudaFree(d_array); // 释放设备内存

Copy Data Between Host and Device

cudaMemcpy(d_array, h_array, size * sizeof(float), cudaMemcpyHostToDevice); // 从主机复制到设备
cudaMemcpy(h_array, d_array, size * sizeof(float), cudaMemcpyDeviceToHost); // 从设备复制到主机

memory declaration

__device__ int d_var; // 设备变量
__shared__ int s_var; // 共享内存变量
__constant__ int c_var; // 常量内存变量
__managed__ int m_var; // 管理内存变量（统一内存）

surface // 表面内存（用于图像处理）

__global__ void kernel() {
    // 在设备代码中使用这些变量
    d_var = threadIdx.x; // 每个线程写入自己的ID到设备变量
    s_var = threadIdx.x; // 每个线程写入自己的ID到共享内存
}

Error Handling

cudaError_t err = cudaGetLastError(); // 获取最后一个错误
if (err != cudaSuccess) {
    printf("CUDA Error: %s\n", cudaGetErrorString(err));
}

THE END