2. BERT

Bidirectional Encoder Representations from Trans- former (BERT) is considered the onset of a revolution in the field of NLP. BERT uses unlabeled text to pre-train deep bidirectional contextual representations. This resulted in rich pre-trained language models that can be fine-tuned with a simple additional output layer to produce state-of-the-art performance in NLP tasks.

Architecture

Core Layers
- Number of transformer layers
- Size of hidden representations
- Number of birectional self attention heads.
Input/Output Representation
- BERT's input is designed to represent NLP downstream tasks involving a single or a pair of sentences using the same input representation design.
- BERT prefixes a special [CLS] token. The hidden vector of this token in the last BERT layer will be used as an aggregate representation for the entire input sequence.
- For NLP tasks with paired sentences, BERT concatenates the sentences into one sequence with a separator token [SEP]. This serves as one way BERT uses to distinguish the two sentences.
- Embeddings
  - WordPiece Tokenization
  - Token Embedding + Segment Embedding + Positional Embedding

Pretraining

BERT pre-training involves combined training with both MLM and NSP tasks by optimizing the model parameters over their combined loss function.
Masked Language Modeling
- Idea is to randomly maskout a percentage of the input sequence tokens, replacing them with the special [MASK] token. During pre-training, the modified input sequence is run through BERT and the output representations of the masked tokens are then fed into a softmax layer.
- Bidirectional attention of the transformer encoder forces the [MASK] prediction task to use the context provided by the other non-masked tokens in the sequence.
- BERT is pre-trained with a 15% mask-out rate. Every token in the 15% masked-out tokens is subjected to the following heuristic:
  - With a probability of 80%, the token is replaced with the special [MASK] token
  - With a probability of 10%, the token is replaced with a random token.
  - With a probability of 10%, the token is left unchanged.
- The MLM task uses cross-entropy loss only over the masked tokens and ignores the prediction of all non-masked ones.
Next Sentence Prediction NSP
- BERT is fed pairs of sentences and pre-trained to predict if the second sentence should follow the first one in a continuous context.
- The first sentence is prefixed with the [CLS] token, then the two sentences are delimited by the special token [SEP].
- Model is given sentence pairs where 50% of the time the second sentence comes after the first sentence and the other 50% the second sentence is a random sentence from the full training corpus.
- BERT representation of [CLS] token encodes both input sentences. Therefore, NSP pre-training is performed by adding a single layer MLP with softmax atop the [CLS] token representation to predict the binary NSP label

Bert-Variants

RoBERTa
- Robustly Optimized BERT Pre-training Approach For the MLM task, BERT randomly masks token during the data pre-processing stage. Therefore, the masks stay static throughout the entire model training process.
- RoBERTa on the other hand follows a dynamic masking strategy where masked tokens are randomly chosen for each training epoch.
- RoBERTa also drops the NSP pre-training task and only uses the dynamic MLM task. Training on longer sequences - Full sentences of at most 512 tokens, are sampled contiguously from one or more documents.
- Large batch Size - RoBERTa showed that using large mini-batches with increased learning rates during pre-training improves the perplexity of the dynamic MLM task as well as the downstream task performance.

BERTopic

Topic modeling is one of the challenging topics in NLP. The advances in BERT and its variants motivate the NLP community to leverage BERT in topic modeling.
Starts by creating the embeddings of the documents of interest using BERT models.
Preprocessing divides the document to smaller paragraphs or sentences that are smaller than the token size for the transformer model.
Clustering is performed on the document embeddings to cluster all the documents with similar topics together.