How Kaggle Uses the Crowd to Solve Your Big Data Problems
Odds are, you have more data about your customers (and other things) than you know what to do with. If so, meet Anthony Goldbloom, who does know what to do with it. He's the founder of Kaggle, a San Francisco startup that makes contests out of solving Big Data problems for companies.
Since Goldbloom launched Kaggle in 2010, it has raised more than $11 million in VC funding and built a community of 140,000 data scientists who compete for cash prizes to solve complex problems for companies such as Facebook, Expedia, and GE. Recently, Goldbloom spoke with Inc.
I saw a need for smarter statistical models.
I grew up in Melbourne, Australia. My background is in statistical modeling, and my first job out of university was working for the Australian Treasury, calculating figures such as gross domestic product and the unemployment rate.
The idea for starting Kaggle came in 2008, after I entered and won an essay contest for The Economist. Part of my prize was to go work there as an intern in London for three months. I pitched an idea for an article about how companies used data analysis and algorithms. After I got the assignment, I started calling up people at big companies like Caesars casino, and I realized that I could do a better job solving the problems they were dealing with.
I turned data analysis into a game.
I came home to Australia with the idea that I could create a meritocratic way of solving problems through data science. Companies could post their problems on my website, and then any statistician who was interested could submit a solution that would be scored against any other entries.
An example would be if you were an insurance company, and you wanted to predict which of your policyholders would be most likely to crash their cars. We would give the competitors two sets of data, say from 2010 and 2011. Then we would also give them the customer base for 2012 and ask them to predict who would be most likely to submit claims. Whoever submits the most accurate prediction model wins the prize.
I got data scientists to compete for prizes.
I think crowdsourcing works particularly well when success can be objectively judged. Our approach really is an elegant way to arrive at solutions. We call it arriving at the "ground truth," where the best solution is not swayed by subjective measures or gut feelings.
I found that it was remarkably easy to recruit data scientists to the site. In 2013, we went from 72,000 to more than 140,000 members. People like solving brainteasers and puzzles, and this taps the same people. Competitions are like honey pots for the smartest and most creative data scientists.
Word spread after we predicted reality-show winners--and the progression rate of HIV.
Our first competition was held in April 2010, and the challenge was to predict the winner of the Eurovision Song Contest, which is the equivalent of a European-wide American Idol. You see some really bizarre acts on the show, and you also see how countries trade votes based on their political allegiances. For instance, Turks who live in Germany were voting for the Turkish contestant. The BBC picked up the story. So did blogs.
Soon after that, we were contacted by someone at Drexel University who wanted to know if we were interested in solving a real problem: predicting the progression rate of HIV. Soon after, we worked with NASA's Jet Propulsion Laboratory, which wanted a study about dark matter. That was enough of a foothold for word of the site to start spreading virally.
Moving to San Francisco helped me take the company to the next level.
About two years ago, I was working out of my bedroom in Melbourne. It was ridiculous. I realized that I was spending a lot of time traveling back and forth to the U.S. to visit clients and attend conferences. It became clear to me that if I stayed in Melbourne, Kaggle would have a 1 percent chance of success. So I moved to San Francisco. And in November 2011, we were able to raise $11 million from investors.
Algorithms don't work for everything.
Kaggle is really a terrible name, because people in the Midwest pronounce it the same as they do the word for doing pelvic exercises. I actually came up with it by writing an algorithm to find one-word names that I might be able to get a URL for. I knew there was no chance of finding a real word, so I was just searching for something that was easy to pronounce and available. Kaggle was the one name on the list that my friends and family voted for.
We rent the winning formulas to businesses.
We now have 15 employees and have worked for 15 Fortune 500 companies. Our biggest prize to date was $500,000 from the Heritage Provider Network, a health care organization that wanted a model for predicting which patients would become hospitalized. A team of seven data scientists submitted the winning algorithm and split the prize. Kaggle earns revenue by charging companies for the algorithms. There's a flat fee, which gives the customer access to an algorithm for six months, and then a monthly license fee after that.
One of the things that we have learned so far is that certain industries are more promising when it comes to the impact of data science. Oil and gas, for example, is an industry that has a lot of data--from the drill bit or from seismic surveys--that is hard to process by hand or to make decisions on just by looking at charts and graphs. Algorithms can take that data and make a better decision than a human can when it comes to deciding where to drill or whether to abandon your lease.
Some experts are worried.
So far, we've gotten the most resistance from subject matter experts who used to make the kinds of decisions we're offering. They are afraid that their skills have become obsolete. But we feel like the best decisions are the result of both data analysis and experts working together. If you don't have an expert in the loop, you won't be asking the right questions.
5 Recent Kaggle Competitions
Design an algorithm to help airplane pilots make better decisions.
Predict which customers will leave an insurance company in the next 12 months.
Devise a better formula for determining the order in which hotel rooms are displayed to customers.
Develop an algorithm for categorizing webpages as new or evergreen.
Create a formula to predict which tags should go on Facebook posts.
Prize: A job interview
Darren Dahl is a contributing editor at Inc. Magazine, which he has written for since 2004. He also works as a collaborative writer and editor and has partnered with several high-profile authors. Dahl lives in Asheville, NC.