February 9, 2018 By Paul Gillin 3 min read

In the four years since Amazon introduced the Echo, the popularity of speech recognition systems has exploded. One reason is that the quality of voice recognition technology has now reached parity with humans. An estimated 27 million Echo and Google Home devices have been sold, according to Computer Intelligence Research Partners (CIRP), and the Consumer Technology Association expected another 4.4 million were sold during this past holiday season.

This surge has made speech recognition a tempting new target for cybercriminals. Thanks to encryption and tunneling, voice-activated devices are believed to be reasonably secure against compromise at the software level, but what about the commands they accept? Recent research has shown that voice recognition itself can be compromised with unsettling ease.

Subverting the Human Ear

Last summer, a group of researchers at Zhejiang University published a paper describing how popular speech recognition systems, such as Apple’s Siri and Google Now, can be activated using high frequencies that are inaudible to humans but can be picked up by electronic microphones. This technique, which the researchers dubbed DolphinAttack, works even if the microphones are wired to ignore high-frequency audio because the harmonic effect produces the same sound at other frequencies.

By boosting the power of those harmonics, researchers were able to command voice-activated assistants to do things such as visit a malicious website, initiate phone calls, send fake text messages and disable wireless communications. Their brief but unsettling demonstration video shows how this is possible.

Hijacking Speech Recognition With Hidden Commands

More recently, two researchers at the University of California, Berkeley published a report that detailed how they were able to embed commands into any kind of audio that’s recognized by Mozilla’s DeepSpeech voice-to-text translation software. The authors claimed that they were able to duplicate any type of audio waveform with 99.9 percent accuracy and transcribe it as any phrase they chose at a rate of 50 characters per second with a 100 percent success rate.

The Berkeley researchers posted samples of these “audio adversarial” clips to demonstrate how they embedded the hidden phrase, “OK Google, browse to evil.com” in the spoken passage “Without the dataset the article is useless.” It’s nearly impossible to tell the difference.

They did it with music too. The samples include a four-second clip from Verdi’s “Requiem” that masks the same command. The only difference between the two clips is a series of subtle chirps that the passive listener probably wouldn’t even notice.

The technique works because of the complex way machine learning algorithms translate speech to text, which is considerably more difficult than interpreting handwriting or images. Because of the many different ways people pronounce the same sounds, speech recognition algorithms use connectionist temporal classification (CTC) to make an educated guess about how each sound translates to a letter. Researchers were able to create an audio waveform that the machine recognized by making slight changes to the input that are nearly undetectable to the human ear. In essence, they were able to cancel out the sound the machine was supposed to hear in favor of the audio they wanted it to hear.

Don’t Panic, But Use Caution

This doesn’t mean you should go home and unplug your Alexa. Both proofs of concept have significant limitations. In the case of DolphinAttack, the audio source had to be within six feet of the target device. It’s also reasonably easy for device owners to defend against hijacks by changing their wake phrases or restricting access to critical apps.

The Berkeley researchers only tested their technique on DeepSpeech, which isn’t used by any of the major voice recognition products. They had detailed knowledge of how DeepSpeech works and the benefit of a highly controlled laboratory environment. There was also quite a bit of computational power involved in refining the audio to embed the hidden commands.

Nevertheless, these academic experiments highlighted the way malicious actors can make these techniques work in the wild. The Berkeley researchers admitted as much, noting in their report that “further work will be able to produce audio adversarial examples that are effective over the air.”

These discoveries are unsettling because voice recognition is on its way to becoming ubiquitous, not just on smartphones, but also in appliances, control devices, sensors and other Internet of Things (IoT) devices. You can imagine the chaos that an attacker could cause by broadcasting hidden commands over a public address system or hijacked TV signal, or even from a boombox in a crowded subway car.

South Park” and Burger King have already provided real-world examples of how this technique could disrupt both consumers and businesses. Their stunts were in good fun, but you can bet that cybercriminals are already thinking of ways to apply them to their own malicious schemes.

Listen to the podcast: The 5 Indisputable Facts About IoT Security

More from Artificial Intelligence

How I got started: AI security researcher

4 min read - For the enterprise, there’s no escape from deploying AI in some form. Careers focused on AI are proliferating, but one you may not be familiar with is AI security researcher. These AI specialists are cybersecurity professionals who focus on the unique vulnerabilities and threats that arise from the use of AI and machine learning (ML) systems. Their responsibilities vary, but key roles include identifying and analyzing potential security flaws in AI models and developing and testing methods malicious actors could…

How a new wave of deepfake-driven cyber crime targets businesses

5 min read - As deepfake attacks on businesses dominate news headlines, detection experts are gathering valuable insights into how these attacks came into being and the vulnerabilities they exploit. Between 2023 and 2024, frequent phishing and social engineering campaigns led to account hijacking and theft of assets and data, identity theft, and reputational damage to businesses across industries. Call centers of major banks and financial institutions are now overwhelmed by an onslaught of deepfake calls using voice cloning technology in efforts to break…

Overheard at RSA Conference 2024: Top trends cybersecurity experts are talking about

4 min read - At a brunch roundtable, one of the many informal events held during the RSA Conference 2024 (RSAC), the conversation turned to the most popular trends and themes at this year’s events. There was no disagreement in what people presenting sessions or companies on the Expo show floor were talking about: RSAC 2024 is all about artificial intelligence (or as one CISO said, “It’s not RSAC; it’s RSAI”). The chatter around AI shouldn’t have been a surprise to anyone who attended…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today