Microsoft’s AI achieves human voice parity, public release on hold

BY

Published 17 Jul 2024

NSFW AI Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

Disclosure

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo

Microsoft claims its new AI text-to-speech (TTS) generator has reached “human parity” as it can now precisely mimic a human voice.

Replacing the first-generation model released earlier in January 2023, VALL-E 2 has been developed to generate speech identical to that of a human using only a few seconds of real audio, according to a research paper published by Microsoft researchers.

“VALL-E 2 can generate accurate, natural speech in the exact voice of the original speaker, comparable to human performance,” the researchers wrote in a blog post.

Surpassing rivals

The U.S. Sun reported that VALL-E 2 outperformed audio samples from existing speech libraries and datasets, like LibriSpeech and VCTK when evaluated in speaker similarity, naturalness, and speech quality.  

Also, using zero-shot learning, the new AI tool showed promising performance in automatically generating not only simple but also complicated thought sentences. This suggests that VALL-E 2 does not need prior examples to comprehend and replicate ideas.

“VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis, achieving human parity for the first time. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases,” the researchers explained.

Removing repetitions

Two advanced features were applied in building the second-gen AI model to enhance its speech synthesis and achieve human parity.

The first one is Repetition Aware Sampling, which resolved performance issues caused by the repetition of tokens. A token is a single linguistic unit of sound or phrase that can trip up AI machines when repeated, similar to how humans stutter over tricky alliteration-heavy sentences.  

The second technique, Groupe Code Modeling, also handled tokens by minimizing their number in each input sequence processed by VALL-E 2, resulting in a faster generation.

Not yet ready for the public

Despite the unprecedented breakthrough, which makes VALL-E 2 useful for people suffering from aphasia and amyotrophic lateral sclerosis, Microsoft emphasized that its new AI tool cannot be made available to the public yet, classifying the tech as a research project rather than a product.

This comes after concerns about voice identification impersonation and voice-spoofing frauds, such as vishing, continue to hound the use of AI tools.

“Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the researchers stated in an ethics statement at the end of the blog. “While VALL-E 2 can speak in a voice like the voice talent… It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker.”

This is not the first time Microsoft stepped back from rolling out its technology due to security and privacy issues. The company has already shelved its Recall AI voice assistant following disputes and worries among target customers.

Likewise, OpenAI has faced a similar obstacle, which forced it to restrict some of its models and build a deep fake detector to enable its users to distinguish between AI-generated and man-made images.