3. PreTraining
Section 3 of the paper provides an in-depth look at the pre-training process for LLaMA 3, covering data collection and curation, model architecture, scaling laws, and the infrastructure supporting the training process. Here’s a detailed breakdown of each part:
3.1 Pre-Training Data
Data Collection and Curation
- Data Sources
- The pre-training dataset for LLaMA 3 is collected from a variety of sources, focusing on high-quality and diverse text. It includes web data, books, scientific papers, code repositories, and other multilingual resources.
- Data Cleaning: Extensive measures are taken to clean the data. The process includes:
- De-duplication: Removing duplicate content at various levels (URL, document, and line-level) to avoid overfitting and bias.
- Filtering: Heuristic and model-based filtering techniques are employed to remove low-quality documents, such as those containing adult content, excessive repetitions, or personal information (PII).
- Code and Math Data: Specialized pipelines are developed to extract code and math-related data, which are essential for tasks requiring reasoning and coding.
Determining the Data Mix
- Knowledge Classification
- To ensure a balanced training dataset, a classifier is used to categorize the collected web data into various knowledge domains. This allows for better control over the proportion of different types of data (e.g., general knowledge, reasoning, code).
- Scaling Laws for Data Mix
- Small-scale models are trained on different data mixes to predict the optimal data distribution for the full-scale model. This helps determine the best mix of data sources for training the final LLaMA 3 models.
Annealing Data
Towards the end of pre-training, the learning rate is annealed (gradually reduced) while focusing on high-quality data sources such as math and coding content. This stage improves the model’s performance on specific tasks like mathematical reasoning, though gains diminish for the largest model.
Model Architecture
LLaMA 3 adopts a standard dense Transformer architecture, similar to LLaMA 2, with incremental improvements to handle larger data and longer context windows:
- Grouped Query Attention (GQA): LLaMA 3 uses GQA, which improves both inference speed and memory efficiency. The attention mechanism is divided into multiple key-value pairs to optimize memory usage, particularly during decoding.
- Longer Context: The model supports a context window of up to 128K tokens, significantly larger than most language models. This is achieved by progressively increasing the context length during pre-training, ensuring the model can handle long documents efficiently.
- Vocabulary: A larger vocabulary size of 128,000 tokens is used, which includes 100K tokens from the OpenAI tiktoken tokenizer and 28K additional tokens for better multilingual support. This improves both compression rates and downstream task performance, without negatively impacting English text processing.
- Rotary Positional Embeddings (RoPE): The model uses RoPE for positional embeddings, which allows better scaling to long sequences. The base frequency hyperparameter for RoPE is set to 500,000, enabling efficient handling of very long contexts.
Scaling Laws
Scaling laws are mathematical models used to predict the optimal size of the model and the amount of data required to achieve certain levels of performance given a specific compute budget. For LLaMA 3:
- Compute-Optimal Model: The team conducts extensive experiments to develop scaling laws that guide the training of the compute-optimal model. These experiments show that LLaMA 3’s 405 billion parameter model is an optimal balance between model size, training compute, and dataset size.
- FLOPs and Training Tokens: LLaMA 3 was trained using 3.8 × 10²⁵ floating point operations (FLOPs) on 15.6 trillion tokens, an increase of nearly 50 times the training scale used for LLaMA 2. The scaling law experiments show that as the compute budget increases, performance becomes robust to small changes in model size and training tokens.
The scaling law analysis also helps forecast performance on downstream tasks like ARC Challenge (a reasoning benchmark), providing valuable insights for model design.
Infrastructure, Scaling, and Efficiency
Training Infrastructure
- Hardware: LLaMA 3 is trained on 16,000 H100 GPUs, each equipped with 80GB of high-bandwidth memory (HBM3). The training takes place on Meta’s high-performance AI server platform, Grand Teton, which uses a combination of RDMA over Converged Ethernet (RoCE) and Nvidia Infiniband networks.
- Storage: A distributed file system, Tectonic, is used for data storage, providing 240 petabytes of capacity and supporting high-throughput operations of up to 7 terabytes per second.
- Network: The cluster is designed with 24K GPUs, connected through a three-layer Clos network that optimizes communication latency and throughput. Enhanced-ECMP (E-ECMP) routing ensures efficient load balancing across the network.
Parallelism for Model Scaling
4D Parallelism: LLaMA 3 employs 4D parallelism, which includes tensor, pipeline, context, and data parallelism to efficiently distribute training across thousands of GPUs:
- Tensor Parallelism (TP): Splits the model weights across mult iple devices. This allows the GPUs to collaboratively perform the computations for a single operation by working on different parts of the tensor simultaneously
- Pipeline Parallelism (PP): Divides the model vertically into stages for concurrent execution. The model's layers are sequentially partitioned, so different GPUs can process different layers of the model simultaneously
- Context Parallelism (CP): Splits long input sequences across devices, improving memory efficiency for long-context training. The input sequence is divided into chunks, and each chunk is processed in parallel by different GPUs.
- Data Parallelism (DP): Processes data in parallel across multiple GPUs while synchronizing after each step. Each GPU processes a different mini-batch of data and computes gradients independently.
In LLaMA 3, these four types of parallelism are integrated in the following way:
Tensor Parallelism (TP) is applied within each layer of the model to distribute the computation of large tensor operations across GPUs. Pipeline Parallelism (PP) is used to distribute different layers (or groups of layers) of the model across GPUs, allowing the model to be split across multiple devices. Context Parallelism (CP) is used to manage long sequences by splitting them across GPUs, allowing the model to handle inputs that are longer than what a single GPU can process. Data Parallelism (DP) is applied at the outermost level, distributing the training data across GPUs and synchronizing the updates to the model parameters
Collective Communication
NCCLX: LLaMA 3 uses a custom fork of the Nvidia NCCL library, NCCLX, which is optimized for high-latency networks. This ensures efficient communication between GPUs across a massive cluster, handling issues like congestion and slowdowns through enhanced algorithms and priority mechanisms for critical data transfers.
Training Recipe
The pre-training process for LLaMA 3 consists of three main stages:
Initial Pre-Training
- Training Recipe: LLaMA 3 is initially pre-trained using the AdamW optimizer with a peak learning rate of 8×10⁻⁵ and a cosine learning rate schedule that decays over 1.2 million steps. The batch size is gradually increased from 4M to 16M tokens over the course of training to improve stability and efficiency.
- Adjusting Data Mix: Throughout training, the team adjusts the data mix to improve model performance, particularly in areas like multilingual understanding and mathematical reasoning.
Long-Context Pre-Training
Context Length Scaling: After the initial training stage, LLaMA 3 undergoes long-context pre-training to adapt to sequences of up to 128K tokens. This is done progressively, starting from 8K tokens and gradually increasing context lengths in stages, ensuring the model learns to handle large contexts without sacrificing performance on shorter contexts.
Annealing
Annealing and Checkpoint Averaging: During the final stages of training, the learning rate is linearly annealed to zero while training on the last 40M tokens. Polyak averaging (averaging model checkpoints) is used during this phase to produce the final pre-trained model, ensuring stability and robustness.
Conclusion
In summary, Section 3 of the paper provides a comprehensive overview of the LLaMA 3 pre-training process, covering everything from data collection and architecture design to scaling strategies and infrastructure optimizations. It emphasizes the importance of a well-balanced data mix, large-scale compute, and efficient parallelism to train a model capable of handling a wide range of tasks, including reasoning, coding, and multilingual processing.