February 9, 2018 By Paul Gillin 3 min read

In the four years since Amazon introduced the Echo, the popularity of speech recognition systems has exploded. One reason is that the quality of voice recognition technology has now reached parity with humans. An estimated 27 million Echo and Google Home devices have been sold, according to Computer Intelligence Research Partners (CIRP), and the Consumer Technology Association expected another 4.4 million were sold during this past holiday season.

This surge has made speech recognition a tempting new target for cybercriminals. Thanks to encryption and tunneling, voice-activated devices are believed to be reasonably secure against compromise at the software level, but what about the commands they accept? Recent research has shown that voice recognition itself can be compromised with unsettling ease.

Subverting the Human Ear

Last summer, a group of researchers at Zhejiang University published a paper describing how popular speech recognition systems, such as Apple’s Siri and Google Now, can be activated using high frequencies that are inaudible to humans but can be picked up by electronic microphones. This technique, which the researchers dubbed DolphinAttack, works even if the microphones are wired to ignore high-frequency audio because the harmonic effect produces the same sound at other frequencies.

By boosting the power of those harmonics, researchers were able to command voice-activated assistants to do things such as visit a malicious website, initiate phone calls, send fake text messages and disable wireless communications. Their brief but unsettling demonstration video shows how this is possible.

Hijacking Speech Recognition With Hidden Commands

More recently, two researchers at the University of California, Berkeley published a report that detailed how they were able to embed commands into any kind of audio that’s recognized by Mozilla’s DeepSpeech voice-to-text translation software. The authors claimed that they were able to duplicate any type of audio waveform with 99.9 percent accuracy and transcribe it as any phrase they chose at a rate of 50 characters per second with a 100 percent success rate.

The Berkeley researchers posted samples of these “audio adversarial” clips to demonstrate how they embedded the hidden phrase, “OK Google, browse to evil.com” in the spoken passage “Without the dataset the article is useless.” It’s nearly impossible to tell the difference.

They did it with music too. The samples include a four-second clip from Verdi’s “Requiem” that masks the same command. The only difference between the two clips is a series of subtle chirps that the passive listener probably wouldn’t even notice.

The technique works because of the complex way machine learning algorithms translate speech to text, which is considerably more difficult than interpreting handwriting or images. Because of the many different ways people pronounce the same sounds, speech recognition algorithms use connectionist temporal classification (CTC) to make an educated guess about how each sound translates to a letter. Researchers were able to create an audio waveform that the machine recognized by making slight changes to the input that are nearly undetectable to the human ear. In essence, they were able to cancel out the sound the machine was supposed to hear in favor of the audio they wanted it to hear.

Don’t Panic, But Use Caution

This doesn’t mean you should go home and unplug your Alexa. Both proofs of concept have significant limitations. In the case of DolphinAttack, the audio source had to be within six feet of the target device. It’s also reasonably easy for device owners to defend against hijacks by changing their wake phrases or restricting access to critical apps.

The Berkeley researchers only tested their technique on DeepSpeech, which isn’t used by any of the major voice recognition products. They had detailed knowledge of how DeepSpeech works and the benefit of a highly controlled laboratory environment. There was also quite a bit of computational power involved in refining the audio to embed the hidden commands.

Nevertheless, these academic experiments highlighted the way malicious actors can make these techniques work in the wild. The Berkeley researchers admitted as much, noting in their report that “further work will be able to produce audio adversarial examples that are effective over the air.”

These discoveries are unsettling because voice recognition is on its way to becoming ubiquitous, not just on smartphones, but also in appliances, control devices, sensors and other Internet of Things (IoT) devices. You can imagine the chaos that an attacker could cause by broadcasting hidden commands over a public address system or hijacked TV signal, or even from a boombox in a crowded subway car.

South Park” and Burger King have already provided real-world examples of how this technique could disrupt both consumers and businesses. Their stunts were in good fun, but you can bet that cybercriminals are already thinking of ways to apply them to their own malicious schemes.

Listen to the podcast: The 5 Indisputable Facts About IoT Security

More from Artificial Intelligence

How prepared are you for your first Gen AI disruption?

5 min read - Generative artificial intelligence (Gen AI) and its use by businesses to enhance operations and profits are the focus of innovation in virtually every sector and industry. Gartner predicts that global spending on AI software will surge from $124 billion in 2022 to $297 billion by 2027. Businesses are upskilling their teams and hiring costly experts to implement new use cases, new ways to leverage data and new ways to use open-source tooling and resources. What they have failed to look…

Brands are changing cybersecurity strategies due to AI threats

3 min read -  Over the past 18 months, AI has changed how we do many things in our work and professional lives — from helping us write emails to affecting how we approach cybersecurity. A recent Voice of SecOps 2024 study found that AI was a huge reason for many shifts in cybersecurity over the past 12 months. Interestingly, AI was both the cause of new issues as well as quickly becoming a common solution for those very same challenges.The study was conducted…

Does your business have an AI blind spot? Navigating the risks of shadow AI

4 min read - With AI now an integral part of business operations, shadow AI has become the next frontier in information security. Here’s what that means for managing risk.For many organizations, 2023 was the breakout year for generative AI. Now, large language models (LLMs) like ChatGPT have become household names. In the business world, they’re already deeply ingrained in numerous workflows, whether you know about it or not. According to a report by Deloitte, over 60% of employees now use generative AI tools…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today