1. Transformer
Encoder Decoder Architecture
Many NLP problems, such as machine translation, question answering, and text summarization use pairs of variable length sequences as inputs to train the model. Encoder-Decoder architecture is used to solve these tasks. Encoder takes in the sequences and converts into fixed length output state. Decoder takes a fixed-length state and converts it back into a variable-length output.
Encoder
- Input sentence is tokenized into words and words are mapped into feature vectors.
- The state \(h_t\) known as the context variable or the context vector encodes the information of the entire input sequence.
- RNN can be bidirectional and thus the hidden state would not only depend on the previous hidden state \(h_{t−1}\) and input xt, but also on the next state \(h_{t+1}\).
Decoder
- Decoder has the output of the encoder, the context variable c, and the given output sequence to generate the decoded outputs.
Training
- Decoder predicts a probability distribution for the output tokens at each time step, and the softmax gives the distribution over the words.
- Encoder and decoder are jointly trained, and the cross-entropy loss is used for optimization.
- Teacher forcing is a strategy to train RNN that uses ground truth as input instead of prior decoded output.
- Teacher forcing helps in addressing the slow convergence and instability problems when training RNNs.
Issues
- Information bottleneck
- Length of words can vary at inference time.
- Difficult to parallellize.
- Vanishing/Exploding gradients.
Attention
- Attention mechanism involves selectively focusing on specific elements while filtering out the less relevant ones.
- Attention mechanism can be considered as a memory with keys and values and a layer which, when someone queries it, generates an output from value whose keys map the input.
- Attention layer measures the similarity between the query and the key using a score function α which returns scores \(a_1, . . . , a_n\) for keys \(k_1,...,k_n\) given by $$ a_i = \alpha(q, k_i) $$
-
Dot Product
- Dot product-based scoring function is the simplest one and has no parameters to tune $$ \alpha(q, k) = q . k $$
-
Scaled Dot Product
- scaled dot product-based scoring function divides the dot product by \(\sqrt d_k\) to remove the influence of dimension of \(d_k\). $$ \alpha(q, k) = \frac{q . k}{\sqrt d_k} $$
-
Attention weights are computed as a softmax function on the scores \(b = softmax(a)\)
-
Final output is weighted sum of attention weights and the values. $$ o = \sum\limits_{i}^{n} b_i v_i $$
Transformer
- Transformer combines the advantages of convolutional neural networks (CNN) to parallelize the computations and recurrent neural networks (RNN) to capture long-range, variable-length sequential information.
- Transformer architecture, to gain speed and parallelism, recurrent neural networks are replaced by multi-head attention layers.
- Word Embeddings - Lookup for tokens in a sentence to convert a sentence of length l, to a matrix W of dimension \(lxd\)
- Positional Encoding
- By taking one word at a time, recurrent neural networks essentially in- corporate word order.
- PE Requirements - Unique encoding value for each time-step, Consitent distance between two time steps across sentences, independent of the length of the sentence, deterministic
Attention
-
Self attention
- Inputs i.e \(x_i\) are converted to the output vectors \(z_i\), through the self-attention layer.
- Each input vector \(x_i\), generates three different vectors: the query, key, and value, \((q_i, k_i, v_i)\).
- query, key, and value vectors are obtained by projecting the input vector \(x_i\), at time i on the learnable weight matrices \(W_q, W_k, W_v\) to get \(q_i\), \(k_i\), and \(v_i\), respectively.
- Key Roles
- Query vector of token i i.e \(q_i\), is to combine with every other key vectors \(\sum\limits_{j =0}^{l} q_i k_j^T\) to influence the weights for its own output \(z_i\).
- Key vector of token i i.e \(k_i\), is to be matched with every other query vectors to get similarity with query and to influence the output through query-key product scoring.
- Value vector of token i i.e \(v_i\), is extracting information by combining with the output of the query-key scores to get the output vector \(z_i\).
-
Multi head Attention
- Instead of a single self-attention head, there can be h parallel self-attention heads; this is known as multi-head attention
- Multi-head attention provides different subspace representations instead of just a single representation for the inputs, which helps capture different aspects of the same inputs.
-
Masked multi head Attention
- We want the decoder to learn from the encoder sequence and a particular decoder sequence, which has been already seen by the model, to predict the next word.
- For the first layer of the decoder, similar to the sequence-to-sequence architecture, only previous target tokens need to be present and others to be masked.
- This is implemented by having a masking weight matrix M that has −∞ for future tokens and 0 for previous tokens. $$ MA(Q, K, V) = Softmax (\frac{Q. K^T + M }{\sqrt d_k}) V $$
Positional Encoding
- Word order and positions play a crucial role in most of the NLP tasks. By taking one word at a time, recurrent neural networks essentially in- corporate word order.
- To gain speed and parallelism, recurrent neural networks are replaced by multi-head attention layers in transformers.
-
Requirements
- Unique encoding value for each time-step
- Consistent distance between two time-steps across sentences of various lengths.
- Encoding results are generalized independent of the length of the sentence
- The encoding is deterministic.
-
Word embeddings W and the positional encoding P are added to generate the input representation \(X=W+P \in R^{lxd}\) .
Encoder
- The encoder block in the transformer consists of N identical layers.
- Each encoder layer has two layers - multi-head self-attention mechanism and positionwise fully connected feed-forward network.
Decoder
- The decoder block in the transformer also consists of N identical layers.
- In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.