Prediction vs. Inference
Introduction
Here we explore the two primary goals often sought from data, to predict and/or to infer. We talk about both individually, how they get used, and how to discern between when one should be more appropriate than the other for analysis.
Prediction
Prediction is fairly straightforward. We have some data, and we want to predict a pattern: what word comes next, what event will happen, when an event will happen, etc. If you recall the discussion on the meaning of f(x) in data science, you will recall that our primary goal is approximate
Inference
In inference, we want to understand, or infer, the nature of the relationship between
The Middle Road
Of course, all analysis utilizes both worlds. Whether through testing, model building, EDA, or just sheer intuition, we often bounce back and forth between these two settings to get a hold of our problem. For instance, doing variable selection in a regression setting can amount to testing what happens when we remove one variable at a time ("I predict if we remove variable
To fantasize of being a purist in either of these camps is to ignore the complexity of data science and to be dishonest. With this in mind…
When to Focus on Inference or Prediction
When you want to answer the following questions, focus on inference:
-
Which input variables are associated with a specific response?
-
What is the relationship between the response and the input variable(s)?
-
Can linear equations represent the relationship between the output and the inputs? Or is a non-linear solution preferred?
And for prediction, typically are we looking for:
-
When an event will occur?
-
What comes next?
-
Whether some data is this or that? (Classification)