What is SeamlessM4T?: A Complete Technical Breakdown of Meta’s Multimodal Translational Model

Language translation is very important in a world where people can talk to each other across countries and cultures. Accurate, context-aware translations are difficult due to written text's nuance and spoken word's rich intonations.

Meta, a leader in technology, moves into this field with SeamlessM4T, a massive multilingual multimodal machine translation model, which is a groundbreaking new idea.

In this blog, we go into the finer points of SeamlessM4T, looking at its technical skills, how it handles the complexities of written and spoken communication, and the amazing potential it has to break down language barriers in a world that is becoming more and more linked.

Problems That Come With Written Communication

Idiomatic Expressions: Written language often uses idiomatic expressions, metaphors, and cultural references that may not have precise translations in other languages. Accurately translating things without losing significance is tough.

Ambiguity: Context and reader interpretation can make language confusing. You need to understand the context to translate this kind of text correctly.

Syntax and Grammar: Different languages have different rules for how to put sentences together, how to order words, and how to use grammar. Text that is translated directly without considering these differences can sound strange and be hard to understand.

Technical and Domain-Specific Language: Specialised terminology is often used in technical papers or texts about specific fields, such as law, medicine, or engineering. To translate these works, you need to know not only the language but also the subject matter.

Problems With Talking to People

Prosody and Intonation: In spoken language, pitch, rhythm, and intonation can change meaning. When translating spoken language, these things must be considered to convey the speaker's feelings and points of emphasis properly.

Pauses and Fillers: People often use stops, fillers (like "um" and "uh"), and other nonverbal cues that may not have direct translations in other languages. These cues can show hesitance, uncertainty, or other minor things that must be understood well.

Background Noise and Quality: Accents and different ways of pronouncing words can make it hard to understand what someone is saying. An accurate translation must deal with these differences to ensure the message is clear.

Nuances of culture: How people speak is strongly connected to their culture. Some words, tones, or actions can have cultural meanings that can be challenging to translate, leading to misunderstandings.

The Problems and Restrictions of Using LLMs in Communication

ChatGPT and other large language models (LLMs) can be helpful resources for addressing specific difficulties associated with written communication. For instance, they may help with translation, facilitating the swift transformation of text from one language to another.

ChatGPT and LLMs both have certain advantages, but they also have some drawbacks. These models need to better reflect the variety and complexity of meaning in colloquial expressions and cultural references. Some of these translations may need to be more literal, losing some of the cultural nuance that was initially intended. Additionally, there is still the problem of contextual ambiguity. While LLMs have come a long way, they sometimes make mistakes and provide translations that only sometimes accurately capture the original meaning, especially in more complicated circumstances.

Transcribing and interpreting prosody, intonation, and non-verbal cues can be challenging for LLMs in oral communication. However, text-based communication is made possible by LLMs. These aspects are lost and crucial in conveying emotions, emphasis, and intentions. Difficulties in transcribing spoken content accurately might arise from factors such as accents, background noise, and variances in pronunciation. To provide accurate and up-to-date translations, LLMs may need ongoing training to keep up with changing linguistic norms, slang, and cultural norms. Here comes SeamlessM4T into this picture!

So, what is SeamlessM4T?

There has been a significant advancement in speech-to-speech and speech-to-text translation and transcription, and the first multimodal model to do so is SeamlessM4T (Massive Multilingual Multimodal Machine Translation). The model is available to the public under the CC BY-NC 4.0 license. It can take in nearly 100 languages (voice + text), output 100 languages (text + English), and produce 35 languages (speech + English).

But is it Meta’s first venture with a speech and text tool?

Even though roughly 3,500 of the world's 6,500 languages are spoken but have no widely used writing system, AI-powered voice translation has primarily concentrated on written languages. Since training an AI model on a massive corpus of textual data is impractical, this effectively precludes the development of practical machine translation tools using current methods.

To overcome this obstacle, Meta developed the first Hokkien speech-to-speech translation system driven by artificial intelligence in 2022. Hokkien is a Chinese dialect spoken by many people outside of China but has no standardized written form.

Meta considerably improved and enlarged the LASER (Language-Agnostic Sentence Representations) toolset to speed up the porting of NLP applications to many more languages. Over 90 languages using 28 script systems are now supported by the toolkit. In contrast to using individual models for each language, LASER achieves these outcomes by embedding them together in a unified space. We have released the multilingual encoder and PyTorch code and a multilingual test suite covering over a hundred languages at no cost to the community.

LASER paves the way for the zero-shot transfer of natural language processing models from one language, like English, to dozens of others, even those with deficient training data. Low-resource languages like Kabyle and Uighur and dialects like Wu Chinese are among those LASER supports. Until now, no other library has used a single model to handle so many different languages. This research has the potential to one day assist Facebook and others with launching a specific NLP function, like determining whether a movie review is positive or negative in one language and then instantaneously deploying it in more than 100 other languages.

The Model of SeamlessM4T

Recent advancements in speech-to-text (S2TT) translation models have shown significant progress, especially in direct translation. These models, which convert spoken language into written text, have improved considerably [Berard et al., 2016; Weiss et al., 2017a; Di Gangi et al., 2019; Agarwal et al., 2023]. They have even achieved parity with traditional cascaded models, particularly in specific scenarios like constrained data and specific language pairs.

However, the landscape has shifted with the emergence of massively multilingual translation models [NLLB Team et al., 2022; Siddhant et al., 2022; Fan et al., 2020] and weakly supervised automatic speech recognition (ASR) models [Radford et al., 2022; Zhang et al., 2023a; Pratap et al., 2023]. These newer models, which leverage extensive labeled data and large foundational models, have made previous comparisons outdated, revealing that direct models need to catch up to robust cascaded models.

The objective of SeamlessM4T is to bridge this gap by enhancing direct speech-to-text translation models for large-scale multilingual and multimodal contexts. This is achieved by creating a more robust direct model for translating both text and speech into text. This combines a powerful speech representation learning model with a multilingual text-to-text (T2TT) translation model. Furthermore, the focus extends beyond text outputs. SeamlessM4T aims to facilitate speech-to-speech translation using UnitY [Inaguma et al., 2023], a two-pass framework that initially generates text and subsequently predicts discrete acoustic units. Unlike traditional cascaded models, UnitY's components can be optimized jointly.

This approach addresses issues related to error propagation and domain mismatch that often plague cascaded systems. It also uses an intermediate semantic representation to alleviate challenges in mapping multi-modal sources to targets. The synthesis of speech through vocoders is separately trained. The SeamlessM4T model comprises four main components: (1) SeamlessM4T-NLLB, a massively multilingual T2TT model; (2) w2v-BERT 2.0, a speech representation learning model utilizing unlabeled speech audio data; (3) T2U, a text-to-unit sequence-to-sequence model; and (4) a multilingual HiFi-GAN unit vocoder for speech synthesis.

The multitask UnitY model within SeamlessM4T integrates elements from these building blocks and undergoes fine-tuning in three stages. Starting as an English-only target X2T model, it progresses to a comprehensive multitask UnitY system. This advanced system is capable of tasks such as T2TT, speech-to-text translation (S2TT), speech-to-speech translation (S2ST), as well as automatic speech recognition (ASR). The model's journey begins with unsupervised speech pre-training (w2v-BERT 2.0), followed by the development of the X2T model, including data preparation and multilingual T2TT capabilities. Subsequently, the speech encoder and T2TT model are jointly fine-tuned to enable multimodal and multitask X2T functionality.

The SeamlessM4T approach also tackles the S2ST task, encompassing acoustic unit extraction, vocoder design, and the mapping of units back to speech waveforms. The pre-training of T2U is described as well. Ultimately, the components converge in the final stage of fine-tuning. The evaluation considers the model's performance across various translation and synthesis tasks, providing insights into its effectiveness and potential impact in advancing multilingual and multimodal communication.

Training

Baseline

The study started by creating a baseline system called VL107 baseline. They trained a model from scratch using VoxLingua107 data, which achieved a 5.25% classification error rate after 30 training epochs on the VoxLingua107 development set. In comparison, a publicly available model called VL107 HF on HuggingFace had a higher error rate of 7%.

Experimental setupOnce they confirmed their training process, they developed their own model over 40 epochs, taking about 172 hours with the help of 8 GPUs. Their training data covered 17,000 hours of speech across 100 languages, averaging 171 hours per language, with some having as little as 1 hour and others as much as 600 hours. For testing, they used a mix of datasets from FLEURS, VoxLingua107, VAANI, IIITH, and KENCORPUS.

Result

They evaluated both 100 SeamlessM4T languages and a subset of 79 languages shared with VoxLingua107. Interestingly, including more languages during training slightly lowered performance for the shared languages due to confusion between closely related ones, like Zulu and Nyanja, Igbo and Yoruba, and Modern Standard Arabic with Moroccan Arabic and Egyptian Arabic.

Filtering

In data extraction, prioritizing data quantity for analysis is crucial, but it's equally essential to maintain robust Language Identification (LID) labelling quality. This involves the consideration of data volume specific to each language, and in some cases, filtering becomes necessary to uphold data quality standards. This study involved assessing Gaussian distribution for LID scores across languages in relation to accurate and inaccurate classifications within the development dataset.

How did Meta gather raw audio and text data?

Regarding audio processing, the initiative commences with a repository of around 4 million hours of raw audio obtained from web crawling. Statistics outlining the volume of raw audio for each language are provided in Table 10, with approximately 1 million hours dedicated to English. A series of systematic pre-processing steps are applied to refine speech quality. The process begins with the deduplication of audio file URLs present in the repository, followed by downloading and resampling audio files to 16KHz. Furthermore, a specialized audio event detection (AED) model filters out non-speech data.

Addressing the need for audio segmentation, a crucial aspect in tasks like Speech-to-Text Translation (S2TT) or Speech-to-Speech Translation (S2ST) mining, the study aims to divide audio files into smaller segments that correspond closely to self-contained sentences, mirroring sentences in text corpora. Given the variability of pauses in speech across languages and their significance to message delivery, adopting a predetermined approach for segment selection proves challenging. To address this, the study adopts an over-segmentation methodology inspired by. This approach involves utilizing an open Voice Activity Detection (VAD) model [Silero, 2021] to break audio files into shorter segments.

Subsequently, a speech Language Identification (LID) model is applied to each segment. Multiple overlapping segment splits are generated, with the final selection of optimal splits being left to the subsequent mining algorithm.

SeamlessM4T Speech Mining Process

Speech mining is the process of automatically analyzing and searching content from an audio signal.

Speech mining was conducted using a margin-based criterion facilitated by the Stopes data processing library. The procedure closely aligns with the methodology established for T2TT mining in NLLB. A global mining approach was adopted, where speech segments in one language were matched against those in another. Contrarily, local mining attempts to capitalize on longer speech sections expected to contain multiple parallel segments. However, obtaining such comprehensive high-level information at scale is notably challenging.

The process begins by computing embeddings for all speech segments and text sentences, subsequently indexed using the FAISS library [Johnson et al., 2019]. This allows efficient large-scale similarity search on GPUs.

The mining operation aligned speech in foreign languages against English texts and English speech. Due to the substantial volume of raw English speech (1 million hours) and foreign text collections (often exceeding 1 billion sentences), this process was selectively executed for specific languages (column Sen2Txx in Table 10). Other directions are identified for potential future exploration.

Except for Maltese, which had limited raw audio availability, alignment of more than 100 hours of speech with English speech was achieved for all languages. Alignments with English texts surpassed a thousand hours for most languages and reached ten thousand hours for six languages (German, French, Spanish, Japanese, Russian, and Mandarin Chinese).

Overall, the SeamlessAlign project encompasses 37 languages, with a cumulative duration of 470,000 hours:

English speech to non-English text (Sen2Txx) — approx. 200,000 hours
Non-English speech to English text (Sxx2Ten) — approx. 240,000 hours
Non-English speech to English speech (Sxx2Sen) — approx. 29,000 hours

Integrating this substantial dataset for training a profoundly multilingual S2ST translation system entails significant computational challenges. While not all mined data was utilized for modeling, only a subset featuring the highest Sonar alignment scores was incorporated.

Is SeamlessM4T bias-free?

Meta used the Multilingual HolisticBias dataset, including its speech extension, to compare the performance of speech-to-text (S2TT) and speech-to-speech (S2ST) translation models. The study focused on two translation directions: "eng–X" (English to another language) and "X–eng" (another language to English). This allowed for evaluating model performance in the presence of different gender references and assessing model robustness when gender inflection was altered.

The outcome of the Experiment:

The outcome of the Experiment was twofold. First, using the Multilingual HolisticBias dataset and its speech extension, the study provided insights into how translation models handle gender biases and references across various languages. This included understanding how translations may inadvertently favor one gender over another due to linguistic and cultural factors. Second, the Experiment evaluated the models' robustness by altering gender inflections. This allowed for assessing whether the models could maintain accurate translations in the face of changes in gender-related linguistic features.

So, Is SeamlessM4T Actually Bias-Free?

No, the SeamlessM4T model is not entirely bias-free. While the research incorporates valuable efforts to address gender biases in translation models, complete freedom from prejudices remains a complex and challenging goal. Biases can emerge from language's inherent biases, which are deeply intertwined with cultural and societal contexts. Despite the meticulous measures taken to reduce biases, language reflects historical, cultural, and linguistic associations.

SeamlessM4T, while striving to mitigate biases as much as possible, operates within the framework of language and culture. As informed by the Experiment, the model's performance provides insights into its behavior in the presence of gender biases and alterations in gender inflexions. However, due to the complex nature of biases and language, achieving absolute bias neutrality is a persistent challenge that extends beyond the scope of any single model.

Challenges of SeaamlessM4T

Distribution of Benefits and Challenges

Just like most technologies, the benefits of SeamlessM4T's application are not uniformly distributed across all user demographics and social situations. While it aims to enhance cross-lingual communication accessibility, certain users might encounter more difficulties than others. Gender, race, accent, or language could influence ASR performance variations. Additionally, translating slang or proper nouns might need to be more consistent, particularly across languages with varying resource availability.

Challenges in Speech-to-Speech Translation (S2ST)

S2ST faces unique challenges due to the immediacy of speech communication compared to written language. In live conversations, speakers need more ability to fully assess the quality of translated output or make real-time corrections. This could result in higher interactional risks, including mistranslations and potential toxicity. Developers and researchers are urged to consider design features that help users navigate these challenges. S2ST-driven applications should be seen as augmentation tools, supporting translation rather than replacing the need for language learning or human interpreters, especially in high-stakes scenarios like legal or medical contexts.

Preserving Natural Expression in S2ST

Speech involves more than spoken text; it encompasses prosodic elements like rhythm, stress, intonation, and emotional components. Ensuring natural and organic output generation in S2ST systems requires further research to preserve the expressivity of human communication. Achieving the equivalent of a "Babel Fish" for seamless language translation demands deeper investments in low-latency speech translation research. This includes developing systems that enable real-time streaming translation potentially finding applications in industries and education.

Possible use cases of SeamlessM4T

As a versatile platform, businesses have many options for how to put SeamlessM4T to work for them. Some examples of applications are as follows.

client service in many languagesSeamlessM4T facilitates multilingual customer service for enterprises. SeamlessM4T enables a company to translate customer inquiries from many languages and provide responses in the customers' preferred language.

Advertisements in many languagesSeamlessM4T is an effective tool for businesses to develop multilingual marketing materials. SeamlessM4T enables firms like yours to translate the content of your website into many languages so that you can connect with a larger customer base.

Multilingual communicationSeamlessM4T is useful for companies to better communicate with their multilingual customers and suppliers. Emails, documents, and presentations are some business materials that could benefit from SeamlessM4T's translation capabilities.

Multilingual data analysisSeamlessM4T enables businesses to combine data from various sources written in multiple languages. For instance, a company can use SeamlessM4T to compare client responses from several regions and spot global trends.

So, what’s next…

Experience the transformative power of seamless communication with SeamlessM4T. As pioneers in the field, we are dedicated to leading the charge toward progress. Join us on this incredible journey and unlock a world of limitless possibilities. In our relentless pursuit to conquer bias, revolutionize speech-to-speech translation, and preserve the true essence of expression, the future is brimming with boundless potential.We can now see what Google and Microsoft, two of Meta's main rivals, come up with!

‍