The newly popular concept of Big Data—essentially the ability to analyze millions or even billions of pieces of information and pull out nuggets of insight—has the tech world salivating. For example, look at how researchers at HP showed that Twitter could predict Hollywood box office sales. Mine that data and a movie studio could find a gold mine of business results. So could any other business with the right tools to parse big data from social networks.
Just a couple of things wrong with this observation. One, there is nothing new about analyzing large data sets. Companies have done it for decades. The difference is that now technology not only speeds up the process and further expands how much you can consider, but makes such techniques available to even the smallest business. And the other, bigger, mistake? Twitter doesn't really predict box office results, according to Princeton researchers.
How can two different sets of researchers come to such opposite conclusions? Because caveats in their results underscore some of the most common traps in data analysis.
Big data lives in the world of statistical analysis and inference. One of the things young statistics students are supposed to learn is that results only count to the degree that your sample is representative of the larger group you study. Twitter may skew to more educated and affluent audiences, possibly older than the audiences for the many blockbuster films aimed at young audiences. Depending on the films and sets of messages included in the analysis, the correlation might look stronger or weaker.
Tip: Be sure that the data you consider is representative of your customers and prospects. Otherwise, you might see non-existent opportunities—or miss real ones.
A scientific understanding of events requires patience. It's not enough that an experiment shows a desired result once. Subsequent attempts should show the same outcomes and other researchers must be able to reproduce the results. This is especially true when dealing with human psychology, whether in individuals or groups. It could be that, in the two years between the work of the HP and Princeton researchers, something changed in the make-up of Twitter users. Given the rate of growth that the company's service has seen, that's easily understandable. Or there may have been a change in public mood or mistakes in methodology or data collection techniques.
Tip: Don't run an analysis just once. Periodically validate what you think you know.
Know what data you're looking at
The two groups of researchers looked at data beyond just tweets. At HP, they built a model that examined the rate of tweets when a film was released and the number of theaters in which the movie first appeared. But the number of theaters alone might suggest a strong correlation to film success. After all, the greater the number of theaters, the more money the studio is probably putting into promotion. The Princeton researchers used machine learning techniques to characterize the tweets as either positive or negative and whether they appeared before, during, or after watching the movie. In other words, the two groups both said they were testing Twitter's predictive ability; in reality, they both examined the accuracy of predictions using tweets and additional data sources like movie openings in theaters and film ratings on International Movie Database and Rotten Tomatoes. And so, maybe Twitter both is and isn't a good predictor of movie success, depending on how you use it.
Tip: One set of data can provide many types of information. Find different ways you can interpret what you have and analyze more than one of them.
Fallibility and bias
People make mistakes. Big, messy ones as well as the small, subtle type. It could be that one of the research groups made an error. For example, perhaps trying to interpret tweets as either positive or negative was too crude a method. Or maybe researchers actually expected a given result and unconsciously chose data or interpretations to support their thesis. Princeton researchers, for example, assumed that using Twitter alone meant looking at consumer sentiment and not the amount of buzz, measured by the number of tweets over time.
Tip: Never bet the company on one analysis. When you think you've learned a key factor to success, test to see if your knowledge actually works.
Big data can provide a great way for entrepreneurs to hone products and services and better address markets. But information takes interpretation and the work of fallible mortals. That's where you can find that what was supposed to help you succeed can actually trick you.