Skip to content

2. BERT

Bidirectional Encoder Representations from Trans- former (BERT) is considered the onset of a revolution in the field of NLP. BERT uses unlabeled text to pre-train deep bidirectional contextual representations. This resulted in rich pre-trained language models that can be fine-tuned with a simple additional output layer to produce state-of-the-art performance in NLP tasks.

Architecture

  • Core Layers

    • Number of transformer layers
    • Size of hidden representations
    • Number of birectional self attention heads.
  • Input/Output Representation

    • BERT's input is designed to represent NLP downstream tasks involving a single or a pair of sentences using the same input representation design.
    • BERT prefixes a special [CLS] token. The hidden vector of this token in the last BERT layer will be used as an aggregate representation for the entire input sequence.
    • For NLP tasks with paired sentences, BERT concatenates the sentences into one sequence with a separator token [SEP]. This serves as one way BERT uses to distinguish the two sentences.
    • Embeddings
      • WordPiece Tokenization
      • Token Embedding + Segment Embedding + Positional Embedding

Pretraining

  • BERT pre-training involves combined training with both MLM and NSP tasks by optimizing the model parameters over their combined loss function.

  • Masked Language Modeling

    • Idea is to randomly maskout a percentage of the input sequence tokens, replacing them with the special [MASK] token. During pre-training, the modified input sequence is run through BERT and the output representations of the masked tokens are then fed into a softmax layer.
    • Bidirectional attention of the transformer encoder forces the [MASK] prediction task to use the context provided by the other non-masked tokens in the sequence.
    • BERT is pre-trained with a 15% mask-out rate. Every token in the 15% masked-out tokens is subjected to the following heuristic:
      • With a probability of 80%, the token is replaced with the special [MASK] token
      • With a probability of 10%, the token is replaced with a random token.
      • With a probability of 10%, the token is left unchanged.
    • The MLM task uses cross-entropy loss only over the masked tokens and ignores the prediction of all non-masked ones.
  • Next Sentence Prediction NSP

    • BERT is fed pairs of sentences and pre-trained to predict if the second sentence should follow the first one in a continuous context.
    • The first sentence is prefixed with the [CLS] token, then the two sentences are delimited by the special token [SEP].
    • Model is given sentence pairs where 50% of the time the second sentence comes after the first sentence and the other 50% the second sentence is a random sentence from the full training corpus.
    • BERT representation of [CLS] token encodes both input sentences. Therefore, NSP pre-training is performed by adding a single layer MLP with softmax atop the [CLS] token representation to predict the binary NSP label

Bert-Variants

  • RoBERTa
    • Robustly Optimized BERT Pre-training Approach For the MLM task, BERT randomly masks token during the data pre-processing stage. Therefore, the masks stay static throughout the entire model training process.
    • RoBERTa on the other hand follows a dynamic masking strategy where masked tokens are randomly chosen for each training epoch.
    • RoBERTa also drops the NSP pre-training task and only uses the dynamic MLM task. Training on longer sequences - Full sentences of at most 512 tokens, are sampled contiguously from one or more documents.
    • Large batch Size - RoBERTa showed that using large mini-batches with increased learning rates during pre-training improves the perplexity of the dynamic MLM task as well as the downstream task performance.

BERTopic

  • Topic modeling is one of the challenging topics in NLP. The advances in BERT and its variants motivate the NLP community to leverage BERT in topic modeling.
  • Starts by creating the embeddings of the documents of interest using BERT models.
  • Preprocessing divides the document to smaller paragraphs or sentences that are smaller than the token size for the transformer model.
  • Clustering is performed on the document embeddings to cluster all the documents with similar topics together.