February 9, 2018 By Paul Gillin 3 min read

In the four years since Amazon introduced the Echo, the popularity of speech recognition systems has exploded. One reason is that the quality of voice recognition technology has now reached parity with humans. An estimated 27 million Echo and Google Home devices have been sold, according to Computer Intelligence Research Partners (CIRP), and the Consumer Technology Association expected another 4.4 million were sold during this past holiday season.

This surge has made speech recognition a tempting new target for cybercriminals. Thanks to encryption and tunneling, voice-activated devices are believed to be reasonably secure against compromise at the software level, but what about the commands they accept? Recent research has shown that voice recognition itself can be compromised with unsettling ease.

Subverting the Human Ear

Last summer, a group of researchers at Zhejiang University published a paper describing how popular speech recognition systems, such as Apple’s Siri and Google Now, can be activated using high frequencies that are inaudible to humans but can be picked up by electronic microphones. This technique, which the researchers dubbed DolphinAttack, works even if the microphones are wired to ignore high-frequency audio because the harmonic effect produces the same sound at other frequencies.

By boosting the power of those harmonics, researchers were able to command voice-activated assistants to do things such as visit a malicious website, initiate phone calls, send fake text messages and disable wireless communications. Their brief but unsettling demonstration video shows how this is possible.

Hijacking Speech Recognition With Hidden Commands

More recently, two researchers at the University of California, Berkeley published a report that detailed how they were able to embed commands into any kind of audio that’s recognized by Mozilla’s DeepSpeech voice-to-text translation software. The authors claimed that they were able to duplicate any type of audio waveform with 99.9 percent accuracy and transcribe it as any phrase they chose at a rate of 50 characters per second with a 100 percent success rate.

The Berkeley researchers posted samples of these “audio adversarial” clips to demonstrate how they embedded the hidden phrase, “OK Google, browse to evil.com” in the spoken passage “Without the dataset the article is useless.” It’s nearly impossible to tell the difference.

They did it with music too. The samples include a four-second clip from Verdi’s “Requiem” that masks the same command. The only difference between the two clips is a series of subtle chirps that the passive listener probably wouldn’t even notice.

The technique works because of the complex way machine learning algorithms translate speech to text, which is considerably more difficult than interpreting handwriting or images. Because of the many different ways people pronounce the same sounds, speech recognition algorithms use connectionist temporal classification (CTC) to make an educated guess about how each sound translates to a letter. Researchers were able to create an audio waveform that the machine recognized by making slight changes to the input that are nearly undetectable to the human ear. In essence, they were able to cancel out the sound the machine was supposed to hear in favor of the audio they wanted it to hear.

Don’t Panic, But Use Caution

This doesn’t mean you should go home and unplug your Alexa. Both proofs of concept have significant limitations. In the case of DolphinAttack, the audio source had to be within six feet of the target device. It’s also reasonably easy for device owners to defend against hijacks by changing their wake phrases or restricting access to critical apps.

The Berkeley researchers only tested their technique on DeepSpeech, which isn’t used by any of the major voice recognition products. They had detailed knowledge of how DeepSpeech works and the benefit of a highly controlled laboratory environment. There was also quite a bit of computational power involved in refining the audio to embed the hidden commands.

Nevertheless, these academic experiments highlighted the way malicious actors can make these techniques work in the wild. The Berkeley researchers admitted as much, noting in their report that “further work will be able to produce audio adversarial examples that are effective over the air.”

These discoveries are unsettling because voice recognition is on its way to becoming ubiquitous, not just on smartphones, but also in appliances, control devices, sensors and other Internet of Things (IoT) devices. You can imagine the chaos that an attacker could cause by broadcasting hidden commands over a public address system or hijacked TV signal, or even from a boombox in a crowded subway car.

South Park” and Burger King have already provided real-world examples of how this technique could disrupt both consumers and businesses. Their stunts were in good fun, but you can bet that cybercriminals are already thinking of ways to apply them to their own malicious schemes.

Listen to the podcast: The 5 Indisputable Facts About IoT Security

More from Artificial Intelligence

AI cybersecurity solutions detect ransomware in under 60 seconds

2 min read - Worried about ransomware? If so, it’s not surprising. According to the World Economic Forum, for large cyber losses (€1 million+), the number of cases in which data is exfiltrated is increasing, doubling from 40% in 2019 to almost 80% in 2022. And more recent activity is tracking even higher.Meanwhile, other dangers are appearing on the horizon. For example, the 2024 IBM X-Force Threat Intelligence Index states that threat group investment is increasingly focused on generative AI attack tools.Criminals have been…

NIST’s role in the global tech race against AI

4 min read - Last year, the United States Secretary of Commerce announced that the National Institute of Standards and Technology (NIST) has been put in charge of launching a new public working group on artificial intelligence (AI) that will build on the success of the NIST AI Risk Management Framework to address this rapidly advancing technology.However, recent budget cuts at NIST, along with a lack of strategy implementation, have called into question the agency’s ability to lead this critical effort. Ultimately, the success…

Researchers develop malicious AI ‘worm’ targeting generative AI systems

2 min read - Researchers have created a new, never-seen-before kind of malware they call the "Morris II" worm, which uses popular AI services to spread itself, infect new systems and steal data. The name references the original Morris computer worm that wreaked havoc on the internet in 1988.The worm demonstrates the potential dangers of AI security threats and creates a new urgency around securing AI models.New worm utilizes adversarial self-replicating promptThe researchers from Cornell Tech, the Israel Institute of Technology and Intuit, used what’s…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today