4. Applications

Since the initial application of transformers to machine translation, the transformer architecture has been applied to computer vision, audio processing, and video processing, as well as other problems in NLP.

Text Processing

BioBERT

Domain-specific language model constructed by fine- tuning BERT on a large collection of biomedical text.
BioBERT outperformed previous models on three NLP tasks useful for biomedical text mining: named entity recognition (NER), relationship extraction, and question answering (QA).
Training
- The starting point is a pre-trained BERT model.
- Next step is to pre-train the model on PubMed abstracts and PubMed Central full-text articles.
- Finally, after the pre-training is over, BioBERT is further fine-tuned on the NER, relationship extraction, and question answering tasks, using task-specific datasets.

SciBERT

It uses the BERT architecture, but was trained on the full text and abstracts of 1.4 million papers from the Semantic Scholar website.
SciBERT uses WordPiece tokenization with 30,000 tokens, but it does not use BERT’s vocabulary
Instead, SciBERT’s vocabulary is built from Semantic Scholar corpus. SciBERT was evaluated on five NLP tasks: NER, PICO extraction, text classification, relationship extraction, and dependency parsing.

Text generation

GPT

In Generative pre-training, model is first trained on unsupervised data, in a task-agnostic fashion, and later fine-tuned for a specific task
GPT is an autoregressive model, which means it uses inputs from previous steps of a sequence to predict values later in the sequence.
GPT was evaluated on four kinds of natural language understanding tasks: natural language inference, question answering, semantic similarity, and text classification.
Training
- Unsupervised Pretraining
- Supervised Finetuning
Model was a stack of 12 transformer decoder layers, with 12 masked self-attention heads. Model dimension is 768, the feedforward layers use \(d_{ff}\) 3072. Positional embeddings were learned, instead of the fixed embeddings used in the standard Transformer.

GPT 2

First GPT model demonstrated the power of generative pre-training, then GPT-2 showed that a language model can learn specific NLP tasks without being explicitly trained on those tasks.
In the standard transformer, the layer norm module comes after the multi-head attention and after the position-wise feedforward network, as part of the residual connection.
Layer norm module instead comes before the multi-head attention and before the position-wise feedforward network.
Each was evaluated on several language modeling datasets without any additional training. GPT-2 achieved state-of-the-art on seven out of eight datasets.

GPT 3

GPT-3 is part of the trend in transformer language models where an increase in the number of parameters leads to an increase in the language model’s ability to perform downstream tasks with little to no task-specific training.
Attention mechanisms in the transformer layers alternated between dense and locally banded sparse patterns.

Computer Vision

ViT

Given an Image with resolution \(H \times W\), ViT works by breaking the two dimensional image into sequence of N patches with an resolution of \(P \times P\).
These sequence of patches is like the sequence of tokens in the standard transformer.
Before sending the patch sequence through the embedding layer, a learnable embedding analogous to the [CLS] token in BERT, \(x_{cls}\) is prepended onto each patch vector
ViT demonstrates that the inductive biases introduced by CNNs are useful for small datasets, but not for larger ones.
Experiments demonstrated that hard-coding the two-dimensional structure of the image patches into the positional encodings does not improve quality.

MultiModal Learning

VilBERT

Vision-and-Language BERT (VilBERT) is a joint model for learning task-agnostic representations for image and text.
Image is first converted into a stream of a sequence of regions mapped to feature vectors using an object detection network.
text follows the normal flow of tokens mapped to positional and word embeddings as a stream of sequences.
VilBERT uses the standard trans- former block (TRM) and a co-attention transformer block (Co-TRM) that provides sparse interactions between the two modalities.
The Co-TRM module computes query, key, and value matrices similar to the standard transformer block for visual and linguistic representations at any intermediate layer.
keys and values from each modality are provided as an input to the other modality’s multi-headed attention block.
Multi-headed attention block of Co-TRM produces attention-pooled features for each modality conditioned on the other and enabling joint learning.
Output is task- dependent and can be as simple as a multi-layer perceptron (MLP) followed by a soft-max layer to compute a score giving similarity between the image and the text.
Training
- First, the text and image are trained independently.
- BERT model is trained end-to-end on a large corpus for two tasks: masked language modeling (MLM) and next sentence pre- diction (NSP).
- Faster R-CNN-based pre-trained object detection network extracts bounding boxes and their visual features from images and processes them through the network.
- Conceptual Captions dataset images and their captions is used for joint learning tasks such as mapping regions in the images to the text and masked word prediction from images.
Model is finetuned for specific taks such as Visual Question-Answering (VQA), Visual Commonsense Reasoning (VCR), Image Retrieval.