Read My Lips: Researchers Find Ways to Fool Speech Recognition Systems
In the four years since Amazon introduced the Echo, the popularity of speech recognition systems has exploded. One reason is that the quality of voice recognition technology has now reached parity with humans. An estimated 27 million Echo and Google Home devices have been sold, according to Computer Intelligence Research Partners (CIRP), and the Consumer Technology Association expected another 4.4 million were sold during this past holiday season.
This surge has made speech recognition a tempting new target for cybercriminals. Thanks to encryption and tunneling, voice-activated devices are believed to be reasonably secure against compromise at the software level, but what about the commands they accept? Recent research has shown that voice recognition itself can be compromised with unsettling ease.
Subverting the Human Ear
Last summer, a group of researchers at Zhejiang University published a paper describing how popular speech recognition systems, such as Apple’s Siri and Google Now, can be activated using high frequencies that are inaudible to humans but can be picked up by electronic microphones. This technique, which the researchers dubbed DolphinAttack, works even if the microphones are wired to ignore high-frequency audio because the harmonic effect produces the same sound at other frequencies.
By boosting the power of those harmonics, researchers were able to command voice-activated assistants to do things such as visit a malicious website, initiate phone calls, send fake text messages and disable wireless communications. Their brief but unsettling demonstration video shows how this is possible.
Hijacking Speech Recognition With Hidden Commands
More recently, two researchers at the University of California, Berkeley published a report that detailed how they were able to embed commands into any kind of audio that’s recognized by Mozilla’s DeepSpeech voice-to-text translation software. The authors claimed that they were able to duplicate any type of audio waveform with 99.9 percent accuracy and transcribe it as any phrase they chose at a rate of 50 characters per second with a 100 percent success rate.
The Berkeley researchers posted samples of these “audio adversarial” clips to demonstrate how they embedded the hidden phrase, “OK Google, browse to evil.com” in the spoken passage “Without the dataset the article is useless.” It’s nearly impossible to tell the difference.
They did it with music too. The samples include a four-second clip from Verdi’s “Requiem” that masks the same command. The only difference between the two clips is a series of subtle chirps that the passive listener probably wouldn’t even notice.
The technique works because of the complex way machine learning algorithms translate speech to text, which is considerably more difficult than interpreting handwriting or images. Because of the many different ways people pronounce the same sounds, speech recognition algorithms use connectionist temporal classiﬁcation (CTC) to make an educated guess about how each sound translates to a letter. Researchers were able to create an audio waveform that the machine recognized by making slight changes to the input that are nearly undetectable to the human ear. In essence, they were able to cancel out the sound the machine was supposed to hear in favor of the audio they wanted it to hear.
Don’t Panic, But Use Caution
This doesn’t mean you should go home and unplug your Alexa. Both proofs of concept have significant limitations. In the case of DolphinAttack, the audio source had to be within six feet of the target device. It’s also reasonably easy for device owners to defend against hijacks by changing their wake phrases or restricting access to critical apps.
The Berkeley researchers only tested their technique on DeepSpeech, which isn’t used by any of the major voice recognition products. They had detailed knowledge of how DeepSpeech works and the benefit of a highly controlled laboratory environment. There was also quite a bit of computational power involved in refining the audio to embed the hidden commands.
Nevertheless, these academic experiments highlighted the way malicious actors can make these techniques work in the wild. The Berkeley researchers admitted as much, noting in their report that “further work will be able to produce audio adversarial examples that are effective over the air.”
These discoveries are unsettling because voice recognition is on its way to becoming ubiquitous, not just on smartphones, but also in appliances, control devices, sensors and other Internet of Things (IoT) devices. You can imagine the chaos that an attacker could cause by broadcasting hidden commands over a public address system or hijacked TV signal, or even from a boombox in a crowded subway car.
“South Park” and Burger King have already provided real-world examples of how this technique could disrupt both consumers and businesses. Their stunts were in good fun, but you can bet that cybercriminals are already thinking of ways to apply them to their own malicious schemes.