Don’t overestimate AI’s understanding of human language

Don’t overestimate AI’s understanding of human language

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

It’s very easy to misread and overestimate achievements in artificial intelligence. And nowhere is this more evident than in the domain of human language, where appearances can falsely hint at in-depth capabilities. In the past year, we’ve seen any number of companies giving the impression that their chatbots, robots and other applications can engage in meaningful conversations as a human would.

You just need to look at Google’s Duplex, Hanson Robotics’ Sophia and numerous other stories to become convinced that we’ve reached a stage that artificial intelligence can manifest human behavior.

But mastering human language requires much more than replicating human-like voices or producing well-formed sentences. It requires commonsense, understanding of context and creativity, none of which current AI trends possess.

To be true, deep learning and other AI techniques have come a long way toward bringing humans and computers closer to each other. But there’s still a huge gap dividing the world of circuits and binary data and the mysteries of the human brain. And unless we don’t understand and acknowledge the differences between AI and human intelligence, we will be disappointed by unmet expectations and miss the real opportunities that advances in artificial intelligence provide.

To understand the true depth of AI’s relation with human language, we’ve broken down the field into different subdomains, going from the surface to the depth.

Speech to text

Voice transcription is one of the areas where AI algorithms have made the most progress. In all fairness, this shouldn’t even be considered artificial intelligence, but the very definition of AI is a bit vague, and since many people might wrongly interpret automated transcription as manifestation of intelligence, we decided to examine it here.

The older iterations of the technology required programmers to go through the tedious process of discovering and codifying the rules of classifying and converting voice samples into text. Thanks to advances in deep learning and deep neural networks, speech-to-text has taken huge leaps and has become both easier and more precise.

With neural networks, instead of coding the rules, you provide plenty of voice samples and their corresponding text. The neural network finds the common patterns among the pronunciation of words and then “learns” to map new voice recordings to their corresponding texts.

These advances have enabled many services to provide real-time transcription services to their users.

There are plenty of uses for AI-powered speech-to-text. Google recently presented Call Screen, a feature on Pixel phones that handles scam calls and shows you the text of the person speaking in real time. YouTube uses deep learning to provide automated close captioning.

But the fact that an AI algorithm can turn voice to text doesn’t mean it understands what it is processing.

Speech synthesis

The flip-side of the speech-to-text is speech synthesis. Again, this really isn’t intelligence because it has nothing to do with understanding the meaning and context of human language. But it is nonetheless an integral part of many applications that interacts with humans in their own language.

Like speech-to-text, speech synthesis has existed for quite a long time. I remember seeing computerized speech synthesis for the first time at a laboratory in the 90s.

ALS patients who have lost their voice have been using the technology for decades communicate by typing sentences and having a computer read it for them. The blind also using the technology to read text they can’t see.

However, in the old days, the voice generated by computers did not sound human, and the creation of a voice model required hundreds of hours of coding and tweaking. Now, with the help of neural networks, synthesizing human voice has become less cumbersome.

The process involves using generative adversarial networks (GAN), an AI technique that pits neural networks against each other to create new data. First, a neural network ingests numerous samples of a person’s voice until it can tell whether a new voice sample belongs to the same person.

Then, a second neural network generates audio data and runs it through the first one to see if validates it as belonging to the subject. If it doesn’t, the generator corrects its sample and re-runs it through the classifier. The two networks repeat the process until they are able to generate samples that sound natural.

There are several websites that enable you to synthesize your own voice using neural networks. The process is as simple as providing it with enough samples of your voice, which is much less than what the older generations of the technology required.

There are many good uses for this technology. For instance, companies are using AI-powered voice synthesis to enhance their customer experience and give their brand its own unique voice.

In the field of medicine, AI is helping ALS patients to regain their true voice instead of using a computerized voice. And of course, Google is using the technology for its Duplex feature to place calls on behalf of users with their own voice.

AI speech synthesis also has its evil uses. Namely, it can be used for forgery, to place calls with the voice of a targeted person, or to spread fake news by imitating the voice of a head of state or high-profile politician.

I guess I don’t need to remind you that if a computer can sound like a human, it doesn’t mean it understands what it says.

Processing human language commands

Android with headphones