7. Vision

Introduction

Section 7 of the paper introduces the vision experiments conducted for LLaMA 3, which incorporate visual-recognition capabilities into the model. The approach to integrating vision into the model is compositional, combining pre-existing image encoders and language models with cross-attention layers to fuse image and text data. This method enables the model to handle both image-text and video-text inputs effectively.

Data Collection and Processing

The vision experiments rely on two main types of data: images and videos.

Image Data: The image encoder and adapter are trained using a dataset of 6 billion image-text pairs. The data processing pipeline includes:
- Quality filtering: Removing low-quality image-text pairs by filtering captions using heuristics like CLIP scores.
- Perceptual de-duplication: To avoid training on redundant data, advanced de-duplication techniques are used, which cluster and compare embeddings.
- Resampling: Ensures diverse image-text pairs are used in training, improving performance in low-frequency categories.
- Optical Character Recognition (OCR): Helps improve the model's ability to understand and interpret text within images.
Video Data: For video pre-training, a dataset of video-text pairs is curated and cleaned through several stages:
- Filtering and cleaning: Text is cleaned using rule-based heuristics, and videos with excessive overlaid text are removed.
- Contrastive filtering: CLIP-like models are used to align video-text pairs, filtering out low-similarity pairs.
- Motion-based filtering: Videos with low motion are excluded to ensure proper alignment between the text and the video content.

Model Architecture

The architecture for the vision recognition model consists of three key components:

Image Encoder: A Vision Transformer (ViT-H/14) trained on 2.5 billion image-text pairs serves as the image encoder. This encoder processes images and aligns them with text representations.
Image Adapter: Cross-attention layers are inserted between the image token representations (produced by the image encoder) and the language token representations from the pre-trained LLaMA 3 model. These cross-attention layers, with 100 billion parameters in the 405B model, facilitate the integration of visual information with textual information.
Video Adapter: Temporal aggregator layers and video cross-attention layers are added for video-text pairs. These layers help the model understand and process temporal data from videos.

Pre-Training and Scaling

The vision components are pre-trained in two stages:
- Initial Pre-Training: The image adapter is pre-trained on 6 billion image-text pairs with images resized to fit within tiles of 336x336 pixels.
- Annealing: Following initial training, the image adapter is fine-tuned on a higher-resolution dataset of 500 million images.
- Scaling: For video pre-training, similar strategies are used to scale the model. A key innovation is sampling a uniform number of video frames and adding cross-attention layers to capture temporal relationships.

Post-Training

Post-training focuses on refining the vision capabilities through supervised fine-tuning (SFT) and optimization techniques like Direct Preference Optimization (DPO):

Supervised Fine-Tuning (SFT): Both image and video adapters are fine-tuned using highly curated multi-modal conversational data. This includes human annotations and synthetic data generated using a text-input LLM, which aids in generating diverse question-answer pairs related to images and videos.
Direct Preference Optimization (DPO): This technique refines model outputs by training it on pairwise preference data, where annotators label responses as "chosen" or "rejected."

Image Recognition Results

LLaMA 3’s vision model is evaluated on multiple image-recognition benchmarks, including tasks like:

VQAv2: Focuses on answering questions about natural images.
DocVQA: Tests document analysis and OCR understanding.
ChartQA: Evaluates the model’s ability to understand and answer questions about charts and visual data.

Results indicate that the LLaMA 3 model with 405B parameters performs competitively across all benchmarks, often outperforming GPT-4V on various tasks, while being slightly behind in certain areas compared to competitors like Claude 3.5 Sonnet.

Video Recognition Results

LLaMA 3’s video adapter is tested on temporal and causal reasoning benchmarks, such as:

PerceptionTest: Evaluates the model's understanding of skills like memory and abstraction in video-based reasoning tasks.
NExT-QA: Focuses on causal reasoning and answering questions based on video content.
TVQA: Assesses the model’s ability to answer questions based on both visual and subtitle data from TV shows.

The results show that LLaMA 3 performs well in video understanding, sometimes outperforming other state-of-the-art models on certain benchmarks.

Conclusion

In conclusion, Section 7 details how the vision capabilities of LLaMA 3 were developed and tested, showcasing its ability to handle complex image-text and video-text tasks efficiently. Through careful data curation, a compositional model architecture, and advanced training techniques, LLaMA 3 delivers competitive performance in multimodal benchmarks, positioning it as a strong competitor in the domain of vision-language models.