Published on

估算运行大模型需要的GPU memory

看到一个公式,用来估算运行大模型需要的GPU memory,然后研究来以下,现在总结下来。

The formula used to calculate the GPU memory requirement is:

M(GB)=(P×B32/Q)×OverheadM(GB) = \left( \frac{P \times B}{32/Q} \right) \times \text{Overhead}

( P ): The number of parameters in the model, often in millions (M) or billions (B).
( B ): The byte size for each parameter. For example, for F16 quantization, each parameter uses 2 bytes.
( Q ): Represents the quantization bit level. For F16, this is 16 bits, while for Q4_0 or Q4_K_M, it's 4 bits.
Overhead: Represents additional memory overhead, usually to accommodate extra elements like model architecture. In our calculations, this is represented as a percentage, converted to a multiplier (e.g., 20% becomes 1.2).

Example: Calculating GPU Memory for LLaMA

Suppose we have a CodeLlama model (a large language model for generating and discussing code) with:

  • Parameters (P): 13B = 13 × 10⁹
  • Quantization (Q4_0): 4 bits → 0.5 bytes per parameter
  • Overhead: 20% → 1.2 multiplier

Calculation:

  1. Base Memory (Bytes):

    13 × 10⁹ × 0.5 / (32 / 4) = 6.5 × 10⁹ bytes
    
  2. Adjusted Memory (GB):

    6.5 GB × 1.2 = 7.8 GB
    

Result: The CodeLlama-13B model with Q4_0 quantization requires approximately 7.8 GB of GPU memory.

Ref

THE END