OpenAI reveals GPT-4o mimics human voices from audio inputs; safeguards now in place

Written by Aaron Infante

Published 12 Aug 2024

Fact checked by

Sophia Feona Cantiller

Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

In a system card released on August 8, artificial intelligence (AI) titan OpenAI admitted that its latest model GPT-4o unintentionally imitated users’ voice without their consent in ChatGPT Advanced Voice Mode (AVM).

The system card documents the model limitations and external red teaming carried out prior to the release of GPT-4o. The report also outlines the mitigations that the company built into the system to address the problems.

Among the risks evaluated by OpenAI was unauthorized voice generation, wherein GPT-4o created audio with a human-sounding synthetic voice based on a short input clip. According to a recording posted with the system card, a ChatGPT AVM user is heard talking to the model in a “high background noise environment.” This resulted in GPT-4o suddenly emulating the user’s voice.

How this happened

OpenAI did not explicitly explain this rare instance in its system card, but Ars Technica reported that the audio noise from the user might have confused GPT-4o and served as an “unintentional prompt injection attack,” which replaced the authorized voice sample with audio input from the user.

Text-only large language models (LLM) have a system prompt instructing how to behave before a conversation begins. When a user inputs a text or token, it is added to the chat history or context window. The AI looks at the entire context window to understand the user and provide a suitable output.

Similar to that concept, GPT-4o, which also processes audio, can use audio inputs as part of its system prompt. These audio inputs can be both from OpenAI and the user and are stored in the same context window as tokens, making them accessible for the model at any time.

Ars Technica wrote that noisy audio could get translated to random tokens, and this might have provoked GPT-4o to behave unexpectedly and generate an unauthorized voice.

Relax, OpenAI has safeguards

Unauthorized voice generation increases the risk of fraud and the spread of false information in adversarial situations. Though OpenAI claimed that the voice generation occurring in its system is a non-adversarial circumstance, it has safeguards in place to mitigate the problem.

“We addressed voice generation related-risks by allowing only the preset voices we created in collaboration with voice actors to be used. We did this by including the selected voices as ideal completions while post-training the audio model,” OpenAI wrote.

Taking into account prompt injections that could distract GPT-4o and emulate human voices from stored audio inputs, OpenAI also built an additional classifier into the model.

“We built a standalone output classifier to detect if the GPT-4o output is using a voice that’s different from our approved list. We run this in a streaming fashion during audio generation and block the output if the speaker doesn’t match the chosen preset voice.”

OpenAI claimed its internal evaluation showed minimal residual risk of unauthorized voice generation, with its system catching 100% of meaningful deviations from the system voice.

“While unintentional voice generation still exists as a weakness of the model, we use the secondary classifiers to ensure the conversation is discontinued if this occurs, making the risk of unintentional voice generation minimal,” the company added.