8. Speech

The Speech Experiments section in your document explores Llama 3's ability to understand and generate speech using compositional methods, similar to how visual recognition was integrated. Here's a detailed breakdown of the key areas:

Speech Understanding

The paper explains how a speech encoder, combined with an adapter, processes speech input in Llama 3. The model can perform different speech tasks depending on the system prompt:

Automatic Speech Recognition (ASR): Converts spoken language into text. Automatic Speech Translation (AST): Translates spoken language into another language.
Spoken Dialogue: The model can act as a general spoken dialogue system, answering questions or having conversations with users.

The speech interface supports 34 languages and can handle text and speech inputs together, making it adept at solving complex audio comprehension tasks.

Data for Speech Understanding

The pre-training data consists of a vast amount of unlabeled speech (15 million hours), using self-supervised learning to initialize the speech encoder.
Supervised finetuning involves specific speech recognition, translation, and spoken dialogue data (230K hours for ASR and 90K hours for AST), which enhances the model's performance in different tasks.

Model Architecture

The speech encoder is based on a Conformer model with 1 billion parameters. It processes 80-dimensional mel-spectrogram features, which are first downsampled to 40 ms intervals. The encoder consists of 24 layers of Conformer blocks, each with a latent dimension of 1536 and rotary attention modules using 24 attention heads.
After encoding, a speech adapter processes the data. This adapter contains 100 million parameters and applies a convolution layer (kernel size of 3, stride of 2) to further reduce the frame length to 80 ms. A rotary Transformer layer and linear layer follow, mapping the speech embeddings to match the Llama 3 language model's embeddings

Speech Generation

Llama 3 also supports speech generation through a streaming text-to-speech (TTS) system that creates real-time speech waveforms as the model decodes text. The paper emphasizes the following:

Text Normalization (TN): This ensures the model transforms written text into correct spoken form depending on context, such as reading numbers as digits or words.
Prosody Modeling (PM): This improves the naturalness and expressiveness of generated speech by predicting key features like phone duration and pitch using Llama 3 embeddings.

Training and Inference for Speech Tasks

Training for Speech Understanding: The model undergoes a two-stage training process: Pre-training with unlabeled speech for generalization. Supervised fine-tuning for specific speech tasks while the language model remains frozen. Speech Generation: Uses a delayed pattern to capture long-range prosodic dependencies, improving latency and responsiveness for real-time synthesis.

Speech Understanding and Generation Results

The paper reports that Llama 3's performance in ASR, AST, and spoken dialogue is strong. On benchmarks such as Multilingual LibriSpeech (MLS) and FLEURS, the model outperforms other specialized systems like Whisper and SeamlessM4T. In terms of prosody modeling, Llama 3's ability to stream token-by-token allows for fast and natural speech synthesis. Results indicate a preference for Llama 3's prosody model over baseline models, with a significant improvement in the perceived quality of generated speech.