4. Post Training

Introduction

Section 4 of the paper discusses post-training, which involves additional fine-tuning steps on a pre-trained LLaMA 3 model to improve its alignment with human instructions, preferences, and downstream task performance. Post-training is key to enhancing the model’s instruction-following abilities, factual accuracy, tool use, and reasoning capabilities.

Post-Training Process

The post-training pipeline includes:

Supervised Fine-Tuning (SFT): The model is fine-tuned on examples either gathered through human annotation or generated synthetically. The SFT process involves teaching the model to better follow human instructions by learning from labeled data. This stage applies a cross-entropy loss to the model output.
Direct Preference Optimization (DPO): After SFT, the model is further trained using preference data (human judgments) to optimize responses. DPO helps align model outputs with user preferences by learning directly from comparisons between different responses.
Rejection Sampling: This technique is used during SFT to curate data, removing low-quality model generations and focusing training on the highest-quality examples.

Chat Dialog Format

To optimize the model for human-AI interactions in a conversational setting, the model is trained on chat dialogues with a specific protocol for exchanging human instructions and AI responses. LLaMA 3’s capabilities, such as tool use, might require generating and sending multiple messages within a single conversational turn.

Reward Modeling

A reward model is trained to guide the model during SFT. This reward model scores responses based on human preferences, helping improve model performance on tasks such as factual accuracy and user alignment. Training the reward model includes handling pairs of responses labeled as “chosen” or “rejected” by human annotators and sometimes involves a third option—an edited version of the chosen response for further refinement.

Supervised Fine-Tuning (SFT)

SFT takes human-labeled prompts and responses and fine-tunes the model using a standard cross-entropy loss. The data used in this process includes both human-generated and synthetic examples, targeting various skills such as code generation, reasoning, and tool use. For the largest LLaMA 3 models (405B parameters), SFT is performed over approximately 8,500 to 9,000 steps with optimized learning rates.

Direct Preference Optimization (DPO)

DPO aligns the model by optimizing for human preferences in a more direct way than SFT. This method helps fine-tune model outputs based on what humans prefer, improving the quality and alignment of generated responses. DPO is applied in multiple rounds to ensure the model's outputs meet user expectations and instructions.

Post-Training Data

The data used for post-training comes from a variety of sources:

Human-Annotated Data: Human annotators generate and review prompts and responses, providing examples that help teach the model better instruction-following and reasoning abilities.
Synthetic Data: LLaMA 3 and earlier models are used to generate synthetic data, which is a cost-effective way to augment the dataset. Synthetic data includes both single-turn examples and multi-turn dialogue examples for more complex tasks.
Data Processing and Quality Control: Various filtering and quality-control mechanisms are applied to ensure the dataset used for post-training is diverse and high-quality, improving model performance across different areas like coding, factuality, and multilingual understanding.

Capabilities

Post-training helps to enhance several key capabilities of LLaMA 3, including:

Reasoning and Coding: The model is specifically fine-tuned to perform well on reasoning benchmarks and code generation tasks, improving its problem-solving abilities and code-writing accuracy.
Tool Use: The model learns to interact with external tools (e.g., function calls, APIs) through specialized training on function-calling tasks.
Factuality: Post-training involves methods to reduce hallucinations and improve factual accuracy, aligning the model with real-world data and ensuring it knows when to defer answers to avoid incorrect information.
Multilingual Performance: The model is fine-tuned to handle tasks in multiple languages, improving its capabilities across diverse linguistic contexts.
Long Context: LLaMA 3 is trained to handle long-context tasks, leveraging its large 128K token context window for better understanding of longer documents.

Conclusion

In conclusion, section 4 of the paper details a comprehensive post-training strategy aimed at refining the LLaMA 3 models through human feedback, specialized finetuning, and advanced optimization techniques like DPO, ultimately enhancing the model's ability to follow instructions, reason, generate accurate information, and handle complex tasks like tool use.