Published on

HelloCuda 系列 第二章: CUDA Architecture

A fundamental building block of a GPU is the Streaming Multiprocessor (SM). Each SM contains multiple CUDA cores, also referred to as streaming processors (SPs), which are capable of executing instructions.

  • threadIdx Each thread within a block has a unique identifier, accessible via the built-in threadIdx variable.
int idx = threadIdx.x;
  • blockIdx Each block within a grid has a unique identifier, accessible via the built-in blockIdx variable. Each block can contain up to 1024 threads, with this limit varying based on the compute capability of the GPU.
int tid = threadIdx.x + blockIdx.x * blockDim.x;
  • gridDim The gridDim variable provides the dimensions of the grid, allowing threads to access their position within the entire grid.

Kernel

A CUDA kernel is a function written in C++, augmented with CUDA-specific extensions, that runs on the GPU rather than the CPU. generally, a kernel is defined with the __global__ qualifier, indicating that it can be called from the host (CPU) and executed on the device (GPU).

// kernel_function<<<num_blocks, num_threads_per_block, shared_mem_bytes, stream>>>

__global__ void myKernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // 执行某些操作
}

the __syncthreads() function. This function acts as a barrier at which all threads in a block must wait until every thread reaches that point in the code.

__global__ void helloFromGPU() {
   int threadId = threadIdx.x + blockIdx.x * blockDim.x;
   
   printf("【%d】 - Hello World from GPU!\n", threadId); 

   __syncthreads(); // Ensure all threads in the block have completed before exiting

   if (threadId == 0) {
      printf("All threads in block %d have completed execution.\n", blockIdx.x);
   }
   
} 

CUDA provides each thread block with a small, programmable space called shared memory. This memory is on-chip

__shared__ int sharedData[256]; // 每个Block共享的内存
__global__ void myKernel() {
    int tid = threadIdx.x;
    sharedData[tid] = tid * tid; // 每个线程计算自己的平方并存储在共享内存中
    __syncthreads(); // 确保所有线程都完成写入

    // 现在可以使用共享内存中的数据进行进一步计算
}

Best Practices for Designing CUDA Kernels

  • Memory Access Patterns

    • Ensure coalesced memory access to maximize memory bandwidth. (This means that consecutive threads in a warp should access consecutive memory locations.)
    • Shared Memory Usage
    • Avoid bank conflicts.(Bank conflicts occur when multiple threads access the same memory bank simultaneously, causing serialization of accesses.)
  • Thread Organization

    • Use appropriate block and grid sizes to maximize occupancy.
    • Coccupancy Consider the number of threads per block and the number of blocks per grid to ensure that the GPU is fully utilized.
    dim3 blockSize = (arraysize + threadsPerBlock.x - 1) / threadsPerBlock.x;
    
    • Avoid Branch Divergence
    • use Async memory transfer. (cudaMemcpyAsync)

Memory Management

  • Alloc and Free Memory
cudaMalloc((void**)&d_array, size * sizeof(float)); // 分配设备内存
cudaFree(d_array); // 释放设备内存
  • Copy Data Between Host and Device
cudaMemcpy(d_array, h_array, size * sizeof(float), cudaMemcpyHostToDevice); // 从主机复制到设备
cudaMemcpy(h_array, d_array, size * sizeof(float), cudaMemcpyDeviceToHost); // 从设备复制到主机
  • memory declaration
__device__ int d_var; // 设备变量
__shared__ int s_var; // 共享内存变量
__constant__ int c_var; // 常量内存变量
__managed__ int m_var; // 管理内存变量(统一内存)

surface // 表面内存(用于图像处理)

__global__ void kernel() {
    // 在设备代码中使用这些变量
    d_var = threadIdx.x; // 每个线程写入自己的ID到设备变量
    s_var = threadIdx.x; // 每个线程写入自己的ID到共享内存
}
  • Error Handling
cudaError_t err = cudaGetLastError(); // 获取最后一个错误
if (err != cudaSuccess) {
    printf("CUDA Error: %s\n", cudaGetErrorString(err));
}

THE END