3. Modifications

Funnel-Transformer compresses the output of a transformer encoder layer via pooling before it is passed to the next layer.
Encoder - The standard transformer uses the same sequence length for all layers. Funnel-Transformer changes this by placing a pooling layer between the transformer layers to reduce the sequence length. Funnel-Transformer uses mean pooling with stride and window size both set to two.
Decoder - To support token-level prediction tasks like machine translation, Funnel-Transformer has an optional decoder that upsamples the compressed encoder output to a full sequence length
When Funnel-Transformer decreases the sequence length and adds more layers, it per- forms better than the standard transformer on text classification. When the sequence length is decreases but the depth is not increased, performance decreases on GLUE text classification datasets.

Residual Attention Layer Transformer adds residual scores to the raw attention logits of all attention heads from the previous layer of a transformer. $$Res. Attn (Q, K, V, Prev) = Softmax (\frac{QK^T}{\sqrt d} + Prev) V $$ where Prev is the attention logits from the previous transformer layer.
This method can be applied to other transformer architectures, including to decoder layers.
RealFormer generally performs better than the stan- dard transformer architecture, including its use in BERT, all without increasing the number of model parameters.

Transformer-XL was introduced because the standard transformer architecture’s fixed-width context window prevents it from learning the model dependencies that lie outside of its fixed window.
Transformer-XL can handle dependencies 450% longer than the standard transformer and inference is ∼1800 times faster than the Transformer.
Segment level recurrence-
This works by using the previous segment of text as additional context when processing the current segment of text.
In the standard transformer, the nth transformer layer takes the out- put of the previous layer (n − 1) as input $$ h_t^n = Transformer(h_t^{n−1})$$
When computing the output of the $n^{th}$ transformer layer for the current segment $X_{t+1}$, we have a contribution from $h_t^{n−1}$, $$ h_{t+1}^n = TransformerXL(h_{t+1}^{n−1}, h_t^{n−1})$$
This modified attention incorporates information from the previous input sequence to compute a representation of the current input sequence.
The transformer output for the current sequence depends on the transformer output for the previous sequence. The transformer output for the previous sequence depends on the sequence before that one and on the previous transformer layer.

When calculating self-attention there are usually no restrictions on which positions in the sequence can attend to each other.
Longformer changes this by restricting which positions can attend to each other according to specific patterns.
This results in sparse attention weights across all heads and corresponds to deleting edges from the attention graph.
Sliding Window Attention-
Each token in a sequence is given a fixed-sized context window, w, so it can only attend to its neighbors.
This simple change reduces the complexity from one that is quadratic in the sequence length to one that is linear in the sequence length, O(Lw).
Dilated Sliding Window Attention -
This attention pattern extends the width of the sliding window attention by add gaps of size d in the context window. (Similar to stride in conv layers)
Global Attention -
The global attention pattern chooses lets some tokens attend to any other token in the sequence. In such cases, all tokens in the sequence attend to that token.
Longformer decides which tokens are allowed to have global attention based on the training task. Longformer combines this global attention with the sliding window attention.
Longformer uses small window sizes for lower layers and larger window sizes for higher layers. This gives the higher layers a hierarchical nature.

Reformer modifies attention mechanism to reduce memory usage by using reversible residual networks.
Reformer can include context windows that are several orders of magnitude larger than a Transformer (up to 1M words).
scaled dot-product attention in transformer has time complexity of O($L^2$), which becomes prohibitive as the number of tokens in the sequence i.e L, increases.
Reformer uses locality-sensitive hashing to reduce the time complexity from O($L^2$) to O(L log L).
Reformer addressesthe memory usage of the standard transformer using reversible residual layers.