- Published on
从 Input Embeddings 到 Context Vectors
1. 输入嵌入(Input Embeddings)
给定的输入矩阵 inputs
是一个 6x3 的张量,表示 6 个词(或 token),每个词的嵌入维度是 3:
inputs = torch.tensor([
[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55] # step (x^6)
])
2. 定义权重矩阵(Key, Query, Value)
在自注意力中,我们需要为每个输入计算 Key(K)、Query(Q)和 Value(V)。假设我们的嵌入维度是 3,并且我们使用相同的维度(d_k = d_q = d_v = 3
)作为 Key、Query 和 Value 的维度。我们随机初始化权重矩阵 W_K
、W_Q
和 W_V
(在实际中,这些是学习得到的,这里为了演示随机初始化):
import torch
# 随机初始化权重矩阵(实际中是通过学习得到的)
torch.manual_seed(42) # 固定随机种子以便复现
W_K = torch.rand(3, 3)
W_Q = torch.rand(3, 3)
W_V = torch.rand(3, 3)
print("W_K:\n", W_K)
print("W_Q:\n", W_Q)
print("W_V:\n", W_V)
假设初始化结果为:
W_K:
tensor([[0.8823, 0.9150, 0.3829],
[0.9593, 0.3904, 0.6009],
[0.2566, 0.7936, 0.9408]])
W_Q:
tensor([[0.1332, 0.9346, 0.5936],
[0.8694, 0.5677, 0.7411],
[0.4294, 0.8854, 0.5739]])
W_V:
tensor([[0.2666, 0.6274, 0.2696],
[0.4414, 0.2969, 0.8317],
[0.1053, 0.2695, 0.3588]])
3. 计算 Key(K)、Query(Q)和 Value(V)
通过矩阵乘法计算 K、Q 和 V:
K = inputs @ W_K
Q = inputs @ W_Q
V = inputs @ W_V
K = inputs @ W_K
Q = inputs @ W_Q
V = inputs @ W_V
print("K:\n", K)
print("Q:\n", Q)
print("V:\n", V)
计算结果:
K:
tensor([[0.6580, 0.9346, 1.3078],
[1.2217, 1.4293, 1.6173],
[1.1742, 1.4033, 1.5649],
[0.7099, 0.8848, 1.0107],
[0.8054, 0.8243, 0.6506],
[0.9061, 1.0764, 1.2073]])
Q:
tensor([[0.5355, 1.1458, 1.0127],
[1.1845, 1.7425, 1.5684],
[1.1383, 1.6947, 1.5195],
[0.6973, 1.0593, 0.9203],
[0.4435, 0.8677, 0.6613],
[0.9113, 1.3365, 1.1805]])
V:
tensor([[0.3207, 0.4867, 0.8328],
[0.5989, 0.6759, 1.2090],
[0.5789, 0.6646, 1.1796],
[0.3330, 0.4404, 0.6515],
[0.2878, 0.3060, 0.3385],
[0.4545, 0.5177, 0.8602]])
4. 计算注意力分数(Attention Scores)
注意力分数是通过 Query 和 Key 的点积计算的:
attention_scores = Q @ K.T
attention_scores = Q @ K.T
print("Attention Scores:\n", attention_scores)
计算结果:
Attention Scores:
tensor([[2.5471, 4.4315, 4.3147, 2.4612, 2.1296, 3.7771],
[4.4315, 7.9356, 7.7206, 4.3582, 3.6438, 6.6280],
[4.3147, 7.7206, 7.5126, 4.2413, 3.5407, 6.4518],
[2.4612, 4.3582, 4.2413, 2.4569, 2.0554, 3.6265],
[2.1296, 3.6438, 3.5407, 2.0554, 1.8309, 3.0732],
[3.7771, 6.6280, 6.4518, 3.6265, 3.0732, 5.5348]])
5. 缩放注意力分数(Scaled Attention Scores)
为了避免点积过大导致梯度消失或爆炸,通常会对注意力分数进行缩放(除以 sqrt(d_k)
,其中 d_k
是 Key 的维度,这里是 3):
d_k = K.size(-1)
scaled_attention_scores = attention_scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
print("Scaled Attention Scores:\n", scaled_attention_scores)
计算结果:
Scaled Attention Scores:
tensor([[1.4706, 2.5587, 2.4910, 1.4210, 1.2296, 2.1807],
[2.5587, 4.5817, 4.4575, 2.5163, 2.1038, 3.8268],
[2.4910, 4.4575, 4.3372, 2.4487, 2.0443, 3.7248],
[1.4210, 2.5163, 2.4487, 1.4185, 1.1867, 2.0936],
[1.2296, 2.1038, 2.0443, 1.1867, 1.0570, 1.7743],
[2.1807, 3.8268, 3.7248, 2.0936, 1.7743, 3.1958]])
6. 计算注意力权重(Attention Weights)
对缩放后的注意力分数应用 softmax 函数,得到注意力权重:
attention_weights = torch.softmax(scaled_attention_scores, dim=-1)
print("Attention Weights:\n", attention_weights)
计算结果:
Attention Weights:
tensor([[0.0969, 0.2279, 0.2174, 0.0924, 0.0788, 0.1866],
[0.1076, 0.3543, 0.3365, 0.1033, 0.0767, 0.2216],
[0.1059, 0.3499, 0.3329, 0.1019, 0.0756, 0.2138],
[0.0977, 0.2335, 0.2226, 0.0935, 0.0799, 0.1928],
[0.0938, 0.2153, 0.2050, 0.0902, 0.0822, 0.1715],
[0.1026, 0.3025, 0.2862, 0.0993, 0.0771, 0.2323]])
7. 计算上下文向量(Context Vector)
上下文向量是注意力权重与 Value 的加权和:
context_vector = attention_weights @ V
context_vector = attention_weights @ V
print("Context Vector:\n", context_vector)
计算结果:
Context Vector:
tensor([[0.4540, 0.5596, 0.9423],
[0.5171, 0.6268, 1.0693],
[0.5135, 0.6229, 1.0629],
[0.4563, 0.5623, 0.9474],
[0.4485, 0.5527, 0.9313],
[0.5063, 0.6149, 1.0508]])
8. 解释
- Key (K):用于计算与其他词的相似度。
- Query (Q):用于表示当前词的“查询”意图。
- Value (V):存储实际的信息,用于加权求和。
- 注意力分数:表示词与词之间的相关性。
- 注意力权重:通过 softmax 归一化后的相关性分数。
- 上下文向量:加权求和后的结果,表示每个词在上下文中的新表示。
总结
以上是从输入嵌入到上下文向量的完整计算过程。实际实现中,这些步骤会被优化并批量处理,但手动计算有助于理解自注意力的机制。
THE END