1. Transformer

Encoder Decoder Architecture

Many NLP problems, such as machine translation, question answering, and text summarization use pairs of variable length sequences as inputs to train the model. Encoder-Decoder architecture is used to solve these tasks. Encoder takes in the sequences and converts into fixed length output state. Decoder takes a fixed-length state and converts it back into a variable-length output.

Encoder

Input sentence is tokenized into words and words are mapped into feature vectors.
The state $h_t$ known as the context variable or the context vector encodes the information of the entire input sequence.
RNN can be bidirectional and thus the hidden state would not only depend on the previous hidden state $h_{t−1}$ and input xt, but also on the next state $h_{t+1}$.

Decoder

Decoder has the output of the encoder, the context variable c, and the given output sequence to generate the decoded outputs.

Training

Decoder predicts a probability distribution for the output tokens at each time step, and the softmax gives the distribution over the words.
Encoder and decoder are jointly trained, and the cross-entropy loss is used for optimization.
Teacher forcing is a strategy to train RNN that uses ground truth as input instead of prior decoded output.
Teacher forcing helps in addressing the slow convergence and instability problems when training RNNs.

Issues

Information bottleneck
Length of words can vary at inference time.
Difficult to parallellize.
Vanishing/Exploding gradients.

Attention

Attention mechanism involves selectively focusing on specific elements while filtering out the less relevant ones.
Attention mechanism can be considered as a memory with keys and values and a layer which, when someone queries it, generates an output from value whose keys map the input.
Attention layer measures the similarity between the query and the key using a score function α which returns scores $a_1, . . . , a_n$ for keys $k_1,...,k_n$ given by $$ a_i = \alpha(q, k_i) $$
Dot Product
- Dot product-based scoring function is the simplest one and has no parameters to tune $$ \alpha(q, k) = q . k $$
Scaled Dot Product
- scaled dot product-based scoring function divides the dot product by $\sqrt d_k$ to remove the influence of dimension of $d_k$. $$ \alpha(q, k) = \frac{q . k}{\sqrt d_k} $$
Attention weights are computed as a softmax function on the scores $b = softmax(a)$
Final output is weighted sum of attention weights and the values. $$ o = \sum\limits_{i}^{n} b_i v_i $$

Transformer

Transformer combines the advantages of convolutional neural networks (CNN) to parallelize the computations and recurrent neural networks (RNN) to capture long-range, variable-length sequential information.
Transformer architecture, to gain speed and parallelism, recurrent neural networks are replaced by multi-head attention layers.
Word Embeddings - Lookup for tokens in a sentence to convert a sentence of length l, to a matrix W of dimension $lxd$
Positional Encoding
- By taking one word at a time, recurrent neural networks essentially in- corporate word order.
- PE Requirements - Unique encoding value for each time-step, Consitent distance between two time steps across sentences, independent of the length of the sentence, deterministic

Attention

Self attention
- Inputs i.e $x_i$ are converted to the output vectors $z_i$, through the self-attention layer.
- Each input vector $x_i$, generates three different vectors: the query, key, and value, $(q_i, k_i, v_i)$.
- query, key, and value vectors are obtained by projecting the input vector $x_i$, at time i on the learnable weight matrices $W_q, W_k, W_v$ to get $q_i$, $k_i$, and $v_i$, respectively.
- Key Roles
  - Query vector of token i i.e $q_i$, is to combine with every other key vectors $\sum\limits_{j =0}^{l} q_i k_j^T$ to influence the weights for its own output $z_i$.
  - Key vector of token i i.e $k_i$, is to be matched with every other query vectors to get similarity with query and to influence the output through query-key product scoring.
  - Value vector of token i i.e $v_i$, is extracting information by combining with the output of the query-key scores to get the output vector $z_i$.
Multi head Attention
- Instead of a single self-attention head, there can be h parallel self-attention heads; this is known as multi-head attention
- Multi-head attention provides different subspace representations instead of just a single representation for the inputs, which helps capture different aspects of the same inputs.
Masked multi head Attention
- We want the decoder to learn from the encoder sequence and a particular decoder sequence, which has been already seen by the model, to predict the next word.
- For the first layer of the decoder, similar to the sequence-to-sequence architecture, only previous target tokens need to be present and others to be masked.
- This is implemented by having a masking weight matrix M that has −∞ for future tokens and 0 for previous tokens. $$ MA(Q, K, V) = Softmax (\frac{Q. K^T + M }{\sqrt d_k}) V $$

Positional Encoding

Word order and positions play a crucial role in most of the NLP tasks. By taking one word at a time, recurrent neural networks essentially in- corporate word order.
To gain speed and parallelism, recurrent neural networks are replaced by multi-head attention layers in transformers.
Requirements
- Unique encoding value for each time-step
- Consistent distance between two time-steps across sentences of various lengths.
- Encoding results are generalized independent of the length of the sentence
- The encoding is deterministic.
Word embeddings W and the positional encoding P are added to generate the input representation $X=W+P \in R^{lxd}$ .

Encoder

The encoder block in the transformer consists of N identical layers.
Each encoder layer has two layers - multi-head self-attention mechanism and positionwise fully connected feed-forward network.

Decoder

The decoder block in the transformer also consists of N identical layers.
In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.