What will be the impact of machine learning on economics? originally appeared on Quora - the knowledge sharing network where compelling questions are answered by people with unique insights.
The short answer is that I think it will have an enormous impact; in the early days, as used "off the shelf," but in the longer run econometricians will modify the methods and tailor them so that they meet the needs of social scientists primarily interested in conducting inference about causal effects and estimating the impact of counterfactual policies (that is, things that haven't been tried yet, or what would have happened if a different policy had been used). Examples of questions economists often study are things like the effects of changing prices, or introducing price discrimination, or changing the minimum wage, or evaluating advertising effectiveness. We want to estimate what would happen in the event of a change, or what would have happened if the change hadn't taken place.
As evidence of the impact already, Guido Imbens and I attracted over 250 economics professors to an NBER session on a Saturday afternoon last summer, where we covered machine learning for economists, and everywhere I present about this topic to economists, I attract large crowds. I think similar things are true for the small set of other economists working in this area. There were hundreds of people in a session on big data at the AEA meetings a few weeks ago.
Machine learning is a broad term; I'm going to use it fairly narrowly here. Within machine learning, there are two branches, supervised and unsupervised machine learning. Supervised machine learning typically entails using a set of "features" or "covariates" (x's) to predict an outcome (y). There are a variety of ML methods, such as LASSO (see Victor Chernozhukov (MIT) and coauthors who have brought this into economics), random forest, regression trees, support vector machines, etc. One common feature of many ML methods is that they use cross-validation to select model complexity; that is, they repeatedly estimate a model on part of the data and then test it on another part, and they find the "complexity penalty term" that fits the data best in terms of mean-squared error of the prediction (the squared difference between the model prediction and the actual outcome). In much of cross-sectional econometrics, the tradition has been that the researcher specifies one model and then checks "robustness" by looking at 2 or 3 alternatives. I believe that regularization and systematic model selection will become a standard part of empirical practice in economics as we more frequently encounter datasets with many covariates, and also as we see the advantages of being systematic about model selection.
Sendhil Mullainathan (Harvard) and Jon Kleinberg with a number of coauthors have argued that there is a set of problems where off-the-shelf ML methods for prediction are the key part of important policy and decision problems. They use examples like deciding whether to do a hip replacement operation for an elderly patient; if you can predict based on their individual characteristics that they will die within a year, then you should not do the operation. Many Americans are incarcerated while awaiting trial; if you can predict who will show up for court, you can let more out on bail. ML algorithms are currently in use for this decision in a number of jurisdictions. Goel, Rao, and Shroff presented a paper at the AEA meetings a few weeks ago using ML methods to examine stop-and-frisk laws. See also the interesting work using ML prediction methods in the session I discussed on "Predictive Cities":where we see ML used in the public sector.
Despite these fascinating examples, in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated. Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data. The place where the econometric model with a causal estimate would do better is at fitting what happens if the firm actually changes prices at a given point in time--at doing counterfactual predictions when the world changes. Techniques like instrumental variables seek to use only some of the information that is in the data--the "clean" or "exogenous" or "experiment-like" variation in price--sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML.
In some of my research, I am exploring the idea that you might take the strengths and innovations of ML methods, but apply them to causal inference. It requires changing the objective function, since the ground truth of the causal parameter is not observed in any test set. Statistical theory plays a bigger role, since we need a model of the unobserved thing we want to estimate (the causal effect) in order to define the target that the algorithms optimize for. I'm also working on developing statistical theory for some of the most widely used and successful estimators, like random forests, and adapting them so that they can be used to predict an individual's treatment effects as a function of their characteristics. For example, I can tell you for a particular individual, given their characteristics, how they would respond to a price change, using a method adapted from regression trees or random forests. This will come with a confidence interval as well. You can search for my papers on; I also wrote a paper on using ML methods to systematically asses the robustness of causal estimates in the American Economic Review last year. I hope that some of these methods can be applied in practice to evaluate randomized controlled trials, A/B tests in tech firms, etc. in order to discover systematically heterogeneous treatment effects.
Unsupervised machine learning tools differ from supervised in that there is no outcome variable (no "y"): these tools can be used to find clusters of similar objects. I have used these tools in my own research to find clusters of news articles on a similar topic. They are commonly used to group images or videos; if you say a computer scientist discovered cats on YouTube, it can mean that they used an unsupervised ML method to find a set of similar videos, and when you watch them, a human can see that all the videos in cluster 1572 are about cats, while all the videos in cluster 423 are about dogs. I see these tools as being very useful as an intermediate step in empirical work, as a data-driven way to find similar articles, reviews, products, user histories, etc.
This question originally appeared on Quora - the knowledge sharing network where compelling questions are answered by people with unique insights. You can follow Quora on Twitter, Facebook, and Google+. More questions:
- Technology Companies: Why do technology companies hire economists, and what is their contribution? What kinds of problems do they work on?
- Technology: What is the state of bitcoin and blockchain technology as of early 2016?
- Economists: How do academic economists use A/B testing? How can you A/B test historical events?