12. Transformers

Introduction

In recent years, the Transformer architecture has emerged as a game-changer in the field of natural language processing (NLP), transforming the landscape of machine translation, text generation, and a myriad of other tasks. Originally introduced by Vaswani et al. in their seminal paper "Attention Is All You Need" in 2017, the Transformer has rapidly become the de facto standard for state-of-the-art NLP models.

Transformer architecture comprises of several key components, including the encoder, decoder, self-attention mechanism, multi-head attention, positional encoding, and feed-forward neural network. Let's delve deeper into the intricacies of the aforementioned components.

Dot Product Self Attention

A standard neural network layer $f[x]$ takes a D × 1 input x and applies a linear transformation followed by an activation function like a ReLU, so $$ f[x] = ReLU[\beta + \Omega [x]] $$ where $\beta$ contains biases and $\Omega$ contains the weights

Self Attention block $sa[.]$ takes N inputs $x_1, x_2...x_N$ each of dimension Dx1 and returns N output vectors of same size. In the context of NLP, each input represents a word or word fragment. Set of values is computed for each input $$ v_m = \beta_v + \Omega_vx_m $$ Then the $n^{th}$ output is weighted sum of all the values $v_1, v_2, .... v_N$.
$$ sa_n[x_1, ..., x_N]= \sum\limits_{m=1}^{N} a[x_m, x_n]v_m $$ The scalar weight $a[x_m, x_n]$ is the attention that $n^{th}$ output pays to input $x_m$. N weights $a[.,x_n]$ are non-negative and sum to 1. Hence, self-attention can be thought of as routing the values in different proportions to create each output.

Computing and Weighting Values

Weights $\Omega_v$ and Biases $\beta_v$ are applied to each input. Hence it scales linearly with sequence length N unlike fully connected layers mapping all D*N inputs at once. The value computation can be viewed as a sparse matrix operation with shared parameters.

Attention weights $a[x_m, x_n]$ combine the values from different inputs. It follows that the number of attention weights has a quadratic dependence on the sequence length N, but is independent of the length D of each input $x_n$.

Computing Attention Weights

We saw that the outputs result from two chained linear transformations. Value vectors $\beta_v + \Omega_vx_m$ are computed independently for each input $x_m$. These vectors are linearly combined by the attention weights $a[x_m, x_n]$.

To compute the attention weights we apply two more linear transformation to the inputs $$ \begin{align} q_n = \beta_q + \Omega_qx_n \ k_m = \beta_k + \Omega_kx_m \end{align} $$

where $q_n$ and $k_m$ are called queries and keys respectively. We compute the dot product of queries and keys and pass that through softmax function.

\[ \begin{align} a[x_m, x_n] &= softmax[k_.^T . q_n] \\ &= \frac{exp[k_m^T . q_n]}{\sum\limits_{m^{1}=1}^{N} exp[k_{m^1}^T] . q_n} \end{align} \]

For each $x_n$ attention scores are positive and sum to 1. This is known as dot product self attention.

Dot product returns the similarity of its inputs. So the attention scores $a[x_., x_n]$ depend on the relative similarity between $n^{th}$ query and all of the keys. Softmax operation means the vectors compete to contribute to the final result. Queries and Keys must be of same dimension and this dimension can differ from that of values.

Extensions to dot-product self-attention

In dot product self attention, the computation is the same regardless of the order of the inputs $x_n$. However, order is important when the inputs correspond to the words in a sentence.

Positional encoding

Positional encoding is a technique used to inject information about the position of tokens in a sequence into the input embeddings. The positional encoding is added to the input embeddings before feeding them into the Transformer encoder or decoder.

One commonly used method for positional encoding is the sine and cosine functions. The positional encoding matrix P for a sequence of length L and embedding dimension d is computed as follows:

\[ \begin{align*} P_{(pos, 2i)} = \sin\left(\frac{{pos}}{{10000^{2i/d}}}\right) \\ P_{(pos, 2i+1)} = \cos\left(\frac{{pos}}{{10000^{2i/d}}}\right) \end{align*} \]

Where:

pos is the position of the token in the sequence.
i is the dimension of the embedding.
d is the embedding dimension.

Using positional encoding, Transformers can effectively capture the sequential order of tokens in the input sequence, enabling them to process sequential data such as natural language text more effectively.

Scaled dot product self-attention

The dot products in the attention computation can have large magnitudes and move the arguments to the softmax function into a region where the largest value completely dominates. Small changes to the inputs to the softmax function now have little effect on the output making the model diﬀicult to train. To prevent this, the dot products are scaled by the square root of the dimension $D_q$ of the queries and keys.

Multiple heads

Multiple self-attention mechanisms are usually applied in parallel, and this is known as multi-head self-attention. Now H different sets of values, keys, and queries are computed:

\[ \begin{align} V_h &= \beta_{vh}1^T + \Omega_{vh}X \\ Q_h &= \beta_{qh}1^T + \Omega_{qh}X \\ K_h &= \beta_{kh}1^T + \Omega_{kh}X \\ \end{align} \]

\[ sa_h[x] = V_h. Softmax[\frac{K_h^T . Q_h}{\sqrt D_q}] \]

where we have different parameters of values - $(\beta_{vh}, \Omega_{vh})$, queries -$(\beta_{qh}, \Omega_{qh})$, Keys - $(\beta_{kh}, \Omega_{kh})$ for each head.

Typically, if the dimension of the inputs $x_m$ is D and there are H heads, the values, queries, and keys will all be of size D/H. The outputs of these self-attention mechanisms are vertically concatenated, and another linear transform $\Omega_c$ is applied to combine them.

Multiple heads seem to be necessary to make self-attention work well. It has been speculated that they make the self-attention network more robust to bad initializations.

Transformer Layers

Transformer layer consists of multi-head self attention unit followed by a fully connected network $mlp[x_.]$. Both units are residual networks. In addition, it is typical to add a LayerNorm operation after both the self-attention and fully connected networks. This is similar to BatchNorm but uses statistics across the tokens within a single input sequence to perform the normalization.

Transformer layer can be described as

\[ \begin{align} X &\leftarrow X + MhSa[X] \\ X &\leftarrow LayerNorm[x] \\ x_n &\leftarrow x_n + mlp[x_n] \\ X &\leftarrow LayerNorm[X] \\ \end{align} \]

where the column vectors $x_n$are separately taken from the full data matrix X.

Transformers for NLP

A typical NLP pipeline starts with a tokenizer that splits the text into words or word fragments. Then each of the token is mapped to a learned embedding. These embeddings are passed through a series of transformer layers

Tokenization
- This splits the text into smaller constituent units (tokens) from a vocabulary of possible tokens.
- In practice, a compromise between letters and full words is used, and the final vocabulary includes both common words and word fragments from which larger and less frequent words can be composed.
- The vocabulary is computed using a sub-word tokenizer such as byte pair encoding that greedily merges commonly occurring sub-strings based on their frequency.
Embeddings
- Each token in the vocabulary V is mapped to a unique word embedding, and the embed- dings for the whole vocabulary are stored in a matrix $\Omega_e \in R^{D \times |V|}$
- The input embeddings are computed as $X=\Omega_e T$ and $\Omega_e$ is learned like any other network parameter.
- A typical embedding size D is 1024, and a typical total vocabulary size |V| is 30,000, so even before the main network, there are many parameters in $\Omega_e$ to learn.
Transformer
- Finally, the embedding matrix X representing the text is passed through a series of K transformer layers, called a transformer model.
- There are three types of transformer models.
  - An encoder transforms the text embeddings into a representation that can support a variety of tasks
  - A decoder predicts the next token to continue the input text
  - Encoder-decoders are used in sequence-to-sequence tasks, where one text string is converted into another.