Can you tell the difference between AI-generated computer speech and a real, live human being? Maybe you've always thought you could. Maybe you're fond of Alexa and Siri but believe you would never confuse either of them with an actual woman.

Things are about to get a lot more interesting. Google engineers have been hard at work creating a text-to-speech system called Tacotron 2. According to a paper they published this month, the system first creates a spectrogram of the text, a visual representation of how the speech should sound. That image is put through Google's existing WaveNet algorithm, which uses the image to produce extremely natural sounding human speech. 

Using this method, the researchers report, "Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech." (A mean opinion score is a telecommunications term that measures how true-to-life something sounds.)

As Google's audio samples demonstrate, Tacotron 2 can detect from context the difference between the noun "desert" and the verb "desert," as well as the noun "present" and the verb "present," and alter its pronunciation accordingly. It can place emphasis on capitalized words and apply the proper inflection when asking a question rather than making a statement. 

And it can generate text that sounds so similar to human speech that it's difficult or impossible to know the difference. If you want to see just how hard it is, go to Google's audio samples page, and scroll down to the last set of samples, titled "Tacotron 2 or Human?" There you'll find Tacotron 2 and a real person each saying sentences such as, "That girl did a video about Star Wars lipstick."

SPOILER ALERT: To test yourself, listen to the samples and guess which is which before reading the rest of this column.

So which samples are text-to-speech and which are a real human voice? Google's engineers aren't saying but they've left a very big clue. Each of the .wav file samples has a filename containing either the term "gen" or "gt." Based on the paper, it's highly probable that "gen" indicates speech generated by Tacotron 2, and "gt" is real human speech. ("GT" likely stands for "ground truth," a machine learning term that basically means "the real deal.")

Assuming this is correct, here are the answers to the test:

"That girl did a video about Star Wars lipstick."

Sample 1: Real human

Sample 2: Tacotron 2

"She earned a doctorate in sociology from Columbia University."

Sample 1: Tacotron 2

Sample 2: Real human

"George Washington was the first President of the United States."

Sample 1: Tacotron 2

Sample 2: Real human

"I'm too busy for romance."

Sample 1: Real human

Sample 2: Tacotron 2

How many did you get right? And could you really tell the difference, or did you just have to guess?

Published on: Dec 30, 2017