估算运行大模型需要的GPU memory

看到一个公式，用来估算运行大模型需要的GPU memory，然后研究来以下，现在总结下来。

The formula used to calculate the GPU memory requirement is:

$M(GB) = \left( \frac{P \times B}{32/Q} \right) \times \text{Overhead}$

( P ): The number of parameters in the model, often in millions (M) or billions (B).
( B ): The byte size for each parameter. For example, for F16 quantization, each parameter uses 2 bytes.
( Q ): Represents the quantization bit level. For F16, this is 16 bits, while for Q4_0 or Q4_K_M, it's 4 bits.
Overhead: Represents additional memory overhead, usually to accommodate extra elements like model architecture. In our calculations, this is represented as a percentage, converted to a multiplier (e.g., 20% becomes 1.2).

Example: Calculating GPU Memory for LLaMA

Suppose we have a CodeLlama model (a large language model for generating and discussing code) with:

Parameters (P): 13B = 13 × 10⁹
Quantization (Q4_0): 4 bits → 0.5 bytes per parameter
Overhead: 20% → 1.2 multiplier

Calculation:

Base Memory (Bytes):

13 × 10⁹ × 0.5 / (32 / 4) = 6.5 × 10⁹ bytes

Adjusted Memory (GB):
```
6.5 GB × 1.2 = 7.8 GB
```

Result: The CodeLlama-13B model with Q4_0 quantization requires approximately 7.8 GB of GPU memory.

Ref

GPU Memory Requirement Calculator for AI Models

THE END