Researchers at Microsoft have developed an artificial intelligence (AI) model that can reproduce anyone’s voice based on a three-second snippet.
AI can even preserve the emotional tone of the subject’s voice, including anger and amusement, reports Ars Teknik.
Dozens of audio samples created by the tool, known as Vall-E, can be heard online alongside the human voices they replicate. Examples also include cases where artificial intelligence replicates the acoustic environment of a recording. This means that if it is fed a sample of a phone call, its simulation may sound like it is being spoken on the phone.
However, unlike publicly available AI tools like ChatGPT, Microsoft doesn’t let people play with its new creation. This may be because the company is aware of the dangers of the software falling into the wrong hands. If you think spam texts are bad, imagine getting a fake voice call from a loved one asking for your bank information.
In an ethics statement on the Vall-E website, its creators state: “Since VALL-E is able to synthesize speech that preserves speaker identity, if the model is abused, it may carry potential risks, such as spoofing of voice identification or impersonation of a particular speaker.”
There are already cases of scammers using so-called deep fake voices to try to steal money from businesses. In 2020, a Hong Kong bank manager reportedly transferred $35 million to attackers by being tricked by artificial intelligence-generated voice.
The technology has also become widespread in Hollywood in recent years, which shows its sophistication. Lucasfilm used an AI-generated voice for Darth Vader in the Disney series Obi-Wan Kenobi. Meanwhile, the use of an artificial intelligence version of the late chef Anthony Bourdain’s voice in a documentary called Roadrunner has caused outrage among some fans.
The Microsoft research team said the AI model could improve text-to-speech applications, speech editing, and content creation when combined with other productive AI models like the GPT-3.
The technology uses tools created by Facebook’s parent company Meta, including an audio compression codec called Encodec. It was also trained on an audio library originally compiled by Meta, containing over 60,000 hours of English conversation from over 7,000 speakers.
Ars Technica says that to create a sound simulation, Vall-E analyzes how a person sounds and breaks that information down into components called “markers” using the Encodec. It then uses the training data to figure out how that person would sound outside of the three-second voice sample.