For me, Hal's human-like voice was one of the eeriest and most unforgettable parts of the movie. Of course, the voice behind Hal sounded real because it wasn't produced by a computer, but instead belonged to an actor--a human being.
But a new development from researchers at Google's DeepMind unit, which is working to develop super-intelligent computers, promises to bring what once belonged to the realm of science fiction (and Hollywood) closer to reality. Last week, they announced a breakthrough in producing text-to-speech (TTS), or speech synthesis, using artificial intelligence.
In blind tests using samples in North American English and Mandarin Chinese, DeepMind's WaveNet algorithm beat the best systems in use today (some of which were developed by Google as well) by as much as 50 percent.
In their blog post, the team at DeepMind share a few short audio samples of text to speech that were produced using WaveNet versus other methods.
Voice recognition systems have made great strides in recent years, with Apple, Amazon, and Google offering devices and applications that make use of this technology. Mark Bennett, the international director of Google Play, which sells Android apps, told an Android developer conference in London last week that 20 percent of mobile searches using Google are made by voice, not written text.
Though scientists have trained computers to understand the human voice, their ability to train computers to speak like a human has lagged. Computer-generated speech still sounds choppy and robotic, unlike the smooth-talking Hal in Kubrick's film.
While it may not match the quality of the human voice quite yet, DeepMind's WaveNet does represent a major leap forward in the quality of artificially-generated speech. Here are a few highlights of how it works:
It uses AI to predict speech patterns.
Rather than piece together pre-recorded audio samples like traditional TTS systems, WaveNet uses neural networks which imitate brain function. It's the same technology they used to develop AlphaGo, which beat the top-ranked player in the strategy game Go. WaveNet combines what it knows from pre-recorded samples with a predictive modeling algorithm using artificial intelligence to form speech waves.
It uses raw audio samples--and lots of data.
Most TTS systems piece together pre-recorded audio samples to produce speech based on a given text. WaveNet, by contrast, analyzes the raw waveforms of audio signals generated at 16,000 samples per second, and then constructs new waveforms to generate speech. The drawback to this approach, however, is that it requires enormous computational power, meaning we're unlikely to see broad-based applications of this technology soon.
It can even make music.
Using the same algorithm, DeepMind's researchers also trained WaveNet to create improvised piano pieces that sound like a human is at the keyboard.
Once researchers at DeepMind have resolved the technological barriers to bringing their technology into the mainstream, what kinds of applications are we likely to see? And how will this impact the millions of people whose jobs are dependent on the use of their voice?
Voice artists--like the professionals who record animated movies, TV and radio commercials, audiobooks, and podcast intros today--may have good reason to feel concerned. You can also imagine the potential impact this could have on the hundreds of thousands of people who handle customer service inquiries.
And the rest of us?
Let's see where this technology takes us...