Understanding Attention Mechanisms
8 min read
NLPDeep LearningTransformers
Introduction
The attention mechanism mimics the cognitive attention of humans. It allows the model to focus on relevant parts of the input sequence when predicting the output.
The Scaled Dot-Product Attention
The core formula for attention is:
Where:
- Q is the Query matrix
- K is the Key matrix
- V is the Value matrix
Implementation in PyTorch
attention.py
1
2def scaled_dot_product_attention(query, key, value, mask=None):
3 d_k = query.size(-1)
4 scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
5
6 if mask is not None:
7 scores = scores.masked_fill(mask == 0, -1e9)
8
9 p_attn = F.softmax(scores, dim=-1)
10 return torch.matmul(p_attn, value), p_attn
11