From Paris to the world, French artificial intelligence (AI) research laboratory Kyutai publicly unveiled the pioneer “real-time voice AI assistant” called Moshi during a presentation on July 3.
Aiming to revolutionize human-machine communication, Chief Executive Patrick Pérez claimed that Moshi “thinks while it talks,” equipped with unprecedented key features that push the field of conversational AI a leap forward.
Never-Seen-Before Capabilities
Moshi, a product developed by eight scientists from scratch in six months, prides itself on its real-time and lifelike conversations with users, thanks to an extensive arsenal of various accents, 70 different emotions and speaking styles, and standout ability to handle two audio streams at the same time.
This ability allows the AI model to listen and talk simultaneously, making it extra useful in customer support and social interactions where interruptions and overlapping speech cannot be avoided.
Established in November 2023 and fueled by at least €300 million in investment, Kyutai built Moshi with a multimodal feature that displays textual output besides audio during conversations and is supported for CPU, CUDA, and Metal backends.
Furthermore, the technology is designed to run on consumer-grade devices, such as laptops, without requiring external connections. This prevents the transmission of sensitive data over the Internet and keeps discussions between users and the model confidential for privacy and security.
Contribution to AI Expansion
In addition to all these, Kyutai has made another unrivaled move, announcing that Moshi will be released as an open-source project. The model’s codes and framework will be shared for free to advance collaboration within the AI community.
The decision followed recent criticisms of OpenAI and other big companies regarding safety issues. To address these concerns, the French startup has doubled down on its commitment to responsible AI use.
During the presentation, Kyutai revealed that it is presently developing AI audio identification, watermarking, and signature tracking systems to be incorporated into Moshi. These systems will help spot AI-generated audio and promote accountability and traceability.
Moshi’s licensing will likely be as lax as possible to fulfill its dedication to innovation.
Behind the Voice
The success of this pioneer open-source AI model happened after careful processes that took half a year of fine-tuning 100,000 comprehensive synthetic dialogues created using Text-to-Speech (TTS) technology, ultimately resulting in an end-to-end latency of 200 milliseconds.
The use of annotated speech data instead of text in training Moshi allowed it to learn directly from audio data. This enabled it to know the tones and nuances of human communication, paving the way for more natural and lifelike conversations.
Moreover, Moshi’s realistic voice was developed after collaborating with a professional voice artist to offer the complete package.
Moving forward, the AI industry has set its eyes on Moshi undergoing a series of iterations in the upcoming days. Kyutai intends to enhance its technology via user feedback and the inputs of fellow AI developers.