- Published on
估算运行大模型需要的GPU memory
看到一个公式,用来估算运行大模型需要的GPU memory,然后研究来以下,现在总结下来。
The formula used to calculate the GPU memory requirement is:
( P ): The number of parameters in the model, often in millions (M) or billions (B).
( B ): The byte size for each parameter. For example, for F16 quantization, each parameter uses 2 bytes.
( Q ): Represents the quantization bit level. For F16, this is 16 bits, while for Q4_0 or Q4_K_M, it's 4 bits.
Overhead: Represents additional memory overhead, usually to accommodate extra elements like model architecture. In our calculations, this is represented as a percentage, converted to a multiplier (e.g., 20% becomes 1.2).
Example: Calculating GPU Memory for LLaMA
Suppose we have a CodeLlama model (a large language model for generating and discussing code) with:
- Parameters (P): 13B = 13 × 10⁹
- Quantization (Q4_0): 4 bits → 0.5 bytes per parameter
- Overhead: 20% → 1.2 multiplier
Calculation:
Base Memory (Bytes):
13 × 10⁹ × 0.5 / (32 / 4) = 6.5 × 10⁹ bytes
Adjusted Memory (GB):
6.5 GB × 1.2 = 7.8 GB
Result: The CodeLlama-13B model with Q4_0 quantization requires approximately 7.8 GB of GPU memory.
Ref
THE END