从 Input Embeddings 到 Context Vectors

1. 输入嵌入（Input Embeddings）

给定的输入矩阵 inputs 是一个 6x3 的张量，表示 6 个词（或 token），每个词的嵌入维度是 3：

inputs = torch.tensor([
    [0.43, 0.15, 0.89], # Your (x^1)
    [0.55, 0.87, 0.66], # journey (x^2)
    [0.57, 0.85, 0.64], # starts (x^3)
    [0.22, 0.58, 0.33], # with (x^4)
    [0.77, 0.25, 0.10], # one (x^5)
    [0.05, 0.80, 0.55]  # step (x^6)
])

2. 定义权重矩阵（Key, Query, Value）

在自注意力中，我们需要为每个输入计算 Key（K）、Query（Q）和 Value（V）。假设我们的嵌入维度是 3，并且我们使用相同的维度（d_k = d_q = d_v = 3）作为 Key、Query 和 Value 的维度。我们随机初始化权重矩阵 W_K、W_Q 和 W_V（在实际中，这些是学习得到的，这里为了演示随机初始化）：

import torch

# 随机初始化权重矩阵（实际中是通过学习得到的）
torch.manual_seed(42)  # 固定随机种子以便复现
W_K = torch.rand(3, 3)
W_Q = torch.rand(3, 3)
W_V = torch.rand(3, 3)

print("W_K:\n", W_K)
print("W_Q:\n", W_Q)
print("W_V:\n", W_V)

假设初始化结果为：

W_K:
 tensor([[0.8823, 0.9150, 0.3829],
        [0.9593, 0.3904, 0.6009],
        [0.2566, 0.7936, 0.9408]])

W_Q:
 tensor([[0.1332, 0.9346, 0.5936],
        [0.8694, 0.5677, 0.7411],
        [0.4294, 0.8854, 0.5739]])

W_V:
 tensor([[0.2666, 0.6274, 0.2696],
        [0.4414, 0.2969, 0.8317],
        [0.1053, 0.2695, 0.3588]])

3. 计算 Key（K）、Query（Q）和 Value（V）

通过矩阵乘法计算 K、Q 和 V：

K = inputs @ W_K
Q = inputs @ W_Q
V = inputs @ W_V

K = inputs @ W_K
Q = inputs @ W_Q
V = inputs @ W_V

print("K:\n", K)
print("Q:\n", Q)
print("V:\n", V)

计算结果：

K:
 tensor([[0.6580, 0.9346, 1.3078],
        [1.2217, 1.4293, 1.6173],
        [1.1742, 1.4033, 1.5649],
        [0.7099, 0.8848, 1.0107],
        [0.8054, 0.8243, 0.6506],
        [0.9061, 1.0764, 1.2073]])

Q:
 tensor([[0.5355, 1.1458, 1.0127],
        [1.1845, 1.7425, 1.5684],
        [1.1383, 1.6947, 1.5195],
        [0.6973, 1.0593, 0.9203],
        [0.4435, 0.8677, 0.6613],
        [0.9113, 1.3365, 1.1805]])

V:
 tensor([[0.3207, 0.4867, 0.8328],
        [0.5989, 0.6759, 1.2090],
        [0.5789, 0.6646, 1.1796],
        [0.3330, 0.4404, 0.6515],
        [0.2878, 0.3060, 0.3385],
        [0.4545, 0.5177, 0.8602]])

4. 计算注意力分数（Attention Scores）

注意力分数是通过 Query 和 Key 的点积计算的：

attention_scores = Q @ K.T

attention_scores = Q @ K.T
print("Attention Scores:\n", attention_scores)

计算结果：

Attention Scores:
 tensor([[2.5471, 4.4315, 4.3147, 2.4612, 2.1296, 3.7771],
        [4.4315, 7.9356, 7.7206, 4.3582, 3.6438, 6.6280],
        [4.3147, 7.7206, 7.5126, 4.2413, 3.5407, 6.4518],
        [2.4612, 4.3582, 4.2413, 2.4569, 2.0554, 3.6265],
        [2.1296, 3.6438, 3.5407, 2.0554, 1.8309, 3.0732],
        [3.7771, 6.6280, 6.4518, 3.6265, 3.0732, 5.5348]])

5. 缩放注意力分数（Scaled Attention Scores）

为了避免点积过大导致梯度消失或爆炸，通常会对注意力分数进行缩放（除以 sqrt(d_k)，其中 d_k 是 Key 的维度，这里是 3）：

d_k = K.size(-1)
scaled_attention_scores = attention_scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
print("Scaled Attention Scores:\n", scaled_attention_scores)

计算结果：

Scaled Attention Scores:
 tensor([[1.4706, 2.5587, 2.4910, 1.4210, 1.2296, 2.1807],
        [2.5587, 4.5817, 4.4575, 2.5163, 2.1038, 3.8268],
        [2.4910, 4.4575, 4.3372, 2.4487, 2.0443, 3.7248],
        [1.4210, 2.5163, 2.4487, 1.4185, 1.1867, 2.0936],
        [1.2296, 2.1038, 2.0443, 1.1867, 1.0570, 1.7743],
        [2.1807, 3.8268, 3.7248, 2.0936, 1.7743, 3.1958]])

6. 计算注意力权重（Attention Weights）

对缩放后的注意力分数应用 softmax 函数，得到注意力权重：

attention_weights = torch.softmax(scaled_attention_scores, dim=-1)
print("Attention Weights:\n", attention_weights)

计算结果：

Attention Weights:
 tensor([[0.0969, 0.2279, 0.2174, 0.0924, 0.0788, 0.1866],
        [0.1076, 0.3543, 0.3365, 0.1033, 0.0767, 0.2216],
        [0.1059, 0.3499, 0.3329, 0.1019, 0.0756, 0.2138],
        [0.0977, 0.2335, 0.2226, 0.0935, 0.0799, 0.1928],
        [0.0938, 0.2153, 0.2050, 0.0902, 0.0822, 0.1715],
        [0.1026, 0.3025, 0.2862, 0.0993, 0.0771, 0.2323]])

7. 计算上下文向量（Context Vector）

上下文向量是注意力权重与 Value 的加权和：

context_vector = attention_weights @ V

context_vector = attention_weights @ V
print("Context Vector:\n", context_vector)

计算结果：

Context Vector:
 tensor([[0.4540, 0.5596, 0.9423],
        [0.5171, 0.6268, 1.0693],
        [0.5135, 0.6229, 1.0629],
        [0.4563, 0.5623, 0.9474],
        [0.4485, 0.5527, 0.9313],
        [0.5063, 0.6149, 1.0508]])

8. 解释

Key (K)：用于计算与其他词的相似度。
Query (Q)：用于表示当前词的“查询”意图。
Value (V)：存储实际的信息，用于加权求和。
注意力分数：表示词与词之间的相关性。
注意力权重：通过 softmax 归一化后的相关性分数。
上下文向量：加权求和后的结果，表示每个词在上下文中的新表示。

总结

以上是从输入嵌入到上下文向量的完整计算过程。实际实现中，这些步骤会被优化并批量处理，但手动计算有助于理解自注意力的机制。

THE END