Understanding Attention Mechanisms

Introduction

The attention mechanism mimics the cognitive attention of humans. It allows the model to focus on relevant parts of the input sequence when predicting the output.

The Scaled Dot-Product Attention

The core formula for attention is:

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Where:

- Q is the Query matrix

- K is the Key matrix

- V is the Value matrix

Implementation in PyTorch

attention.py

1
2def scaled_dot_product_attention(query, key, value, mask=None):
3    d_k = query.size(-1)
4    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
5    
6    if mask is not None:
7        scores = scores.masked_fill(mask == 0, -1e9)
8    
9    p_attn = F.softmax(scores, dim=-1)
10    return torch.matmul(p_attn, value), p_attn
11