In 2019, Google released a synthetic speech database with a very specific goal: stopping audio deepfakes.

“Malicious actors may synthesize speech to try to fool voice authentication systems,” the Google News Initiative blog reported at the time. “Perhaps equally concerning, public awareness of “deep fakes” (audio or video clips generated by deep learning models) can be exploited to manipulate trust in media.”

Ironically, also in 2019, Google introduced the Translatotron artificial intelligence (AI) system to translate speech into another language. By 2021, it was clear that deepfake voice manipulation was a serious issue for anyone relying on AI to mimic speech. Google designed the Translatotron 2 to prevent voice spoofing.

Two-Edged Sword

Google and other tech giants are in a dilemma. AI voice brought us Alexa and Siri; it allows users to use voice to interact with their smartphones and businesses to streamline customer service.

However, many of these same companies also launched — or planned to launch — projects that made AI a little too lifelike. Someone can use this tool for harm as easily as for good. Big tech, then, mostly sidestepped these products. The companies agreed they were too dangerous, no matter how useful.

But smaller companies are just as innovative as big tech. Now that AI and machine learning is somewhat democratized, smaller tech companies are willing to take on the risks and ethical concerns of voice tech. Like it or not, the vocal deepfake is here, easy to use and going to create serious problems.

Ethics (or Lack Thereof) in Voice AI

Some of the largest tech companies are trying to slam the brakes on AI that can mimic live people.

“There are opportunities and harms, and our job is to maximize opportunities and minimize harms,” Tracy Pizzo Frey, ethics committee member at Google, told Reuters.

It’s a tough call to make. Voice AI can be life-changing for many people. It enabled a Rollins College valedictorian, a non-speaking autistic student, to deliver her commencement speech. Or, it could simply make our own lives simpler. You might be able to screen phone calls with an AI voice assistant. Businesses rely on AI to handle customer service calls so seamlessly that the customer may never know they are talking to a machine, not a person.

Add in Attackers

But threat actors can use this, too. In 2019, thieves used “voice-mimicking software to imitate a company executive’s speech” and tricked an employee into transferring nearly a quarter-million dollars to a secret bank account in Hungary, the Washington Post reported. Despite finding the request “rather strange,” the director found the voice to be very lifelike, the article reported.

It’s a familiar lament from those who were duped by scam artists. Spoofed email addresses and phone numbers con thousands of employees. The scammers have just moved on to newer tools.

Then there is the ethics around voice cloning. Is it right to use a voice — especially of someone who has died — for commercial purposes? Who owns the rights to a voice? Is it the person themselves? The family or the estate? Or is a voice up for grabs because it isn’t intellectual property? The answer is that a voice cannot be copyrighted or trademarked. Therefore, no one owns it (not even your own voice).

Voice cloning lacks the restrictions of other copyrighted and trademarked information, making it easy for businesses to use for their personal financial gain. The lack of protections around voice also makes voice cloning for deepfakes easy and profitable for threat actors.

Cybersecurity Threats Around Vocal Deepfakes

A threat actor doesn’t need much to create voice deepfakes. The technology is readily available. A few minutes of an audio recording of someone’s voice will make a rudimentary deepfake. The more audio available, the more realistic the deepfake becomes. Executives are a particularly attractive target. Recordings from webinars and videos on corporate websites and appearances on television or at conferences are widely available.

Attackers often use deepfakes in multilayered business email compromises. Attackers send a phishing email or text message to an employee, along with a deepfake voice message on the recipient’s voice mailbox. Most often, they try to send money. The email includes the specifics about how much money the victim should send and where, while the deepfake voice message provides the authorization to complete the transaction.

Threat actors are also using voice deepfakes more often to bypass voice-activated multifactor authentication. The use of voice biometrics is expected to grow 23% by 2026, thanks to increased use within the financial industry. However, as voice biometrics grow,  threat actors are taking voice fakes to new levels. Attackers are faking messages from banks asking for account numbers, BankInfoSecurity reported.

How to Avoid a Vocal Deepfake

Like all cybersecurity systems, to avoid a voice deepfake you need to take a multilayered approach. The first step is to limit the amount of recorded audio from your organization readily available online. Webinars and other recordings should have restricted access for authenticated visitor traffic only. Discourage high-level executives from video and voice recordings on social media. The less audio available, the more difficult it is to create a near-flawless deepfake.

Employees should be encouraged to follow a zero trust model on anything that doesn’t follow normal procedures. Question everything: if a boss doesn’t normally leave a voice message to follow up on an email message, for instance, verify it before taking action. If the voice seems a little off, again, verify the message.

Reconsider using voice as a stand-alone biometric authentication. Instead, use it with other authentication measures that are more difficult to spoof.

Finally, use technology to fight technology. If threat actors are using AI to create voice deepfakes, businesses should use AI and machine learning to better detect fake vocal messages.

More from Incident Response

X-Force Prevents Zero Day from Going Anywhere

This blog was made possible through contributions from Fred Chidsey and Joseph Lozowski. The X-Force Vulnerability and Exploit Database shows that the number of zero days being released each year is on the rise, but X-Force has observed that only a few of these zero days are rapidly adopted by cyber criminals each year. While every zero day is important and organizations should still devote efforts to patching zero days once a patch is released, there are characteristics of certain…

When the Absence of Noise Becomes Signal: Defensive Considerations for Lazarus FudModule

In February 2023, X-Force posted a blog entitled “Direct Kernel Object Manipulation (DKOM) Attacks on ETW Providers” that details the capabilities of a sample attributed to the Lazarus group leveraged to impair visibility of the malware’s operations. This blog will not rehash analysis of the Lazarus malware sample or Event Tracing for Windows (ETW) as that has been previously covered in the X-Force blog post. This blog will focus on highlighting the opportunities for detection of the FudModule within the…

Breaking Down a Cyberattack, One Kill Chain Step at a Time

In today’s wildly unpredictable threat landscape, the modern enterprise should be familiar with the cyber kill chain concept. A cyber kill chain describes the various stages of a cyberattack pertaining to network security. Lockheed Martin developed the cyber kill chain framework to help organizations identify and prevent cyber intrusions. The steps in a kill chain trace the typical stages of an attack from early reconnaissance to completion. Analysts use the framework to detect and prevent advanced persistent threats (APT). Organizations…

Defining the Cobalt Strike Reflective Loader

The Challenge with Using Cobalt Strike for Advanced Red Team Exercises While next-generation AI and machine-learning components of security solutions continue to enhance behavioral-based detection capabilities, at their core many still rely on signature-based detections. Cobalt Strike being a popular red team Command and Control (C2) framework used by both threat actors and red teams since its debut, continues to be heavily signatured by security solutions. To continue Cobalt Strikes operational usage in the past, we on the IBM X-Force…