explanation vs. prediction

Shots: Confusion in correlations by Matthias Kullowatz

By Matthias Kullowatz (@MattyAnselmo)

Much of the research I do for this site revolves around predictive analysis. I like to know which individual and team skills can be measured with stable metrics, metrics that hold true month after month. However, it's still worthwhile doing what I call explanatory analysis. Explanatory analysis involves finding the variables that explain an outcome which has already happened, even if these variables may fluctuate randomly in the future.

I have shown before that shot quality and quantity correlate well to future outcomes. But with that in mind, it is somewhat confusing that the same shot information doesn't correlate so well to the outcomes of the very same games from which the data were gathered. Here are some interesting facts about shots. 

Over the past four seasons in Major League Soccer, home teams averaged more shots in games they lost than in games they won (14.5 to 14.2). Conversely, away teams averaged more shots in wins than in losses (11.9 to 11.1). When the data are combined, the correlation between shot differential and goal differential within a match is virtually zero (CI: 0.02, 0.13). Superficially, this information seems more confusing that helpful.

This finding has led some to reason that shots are a less important metric when it comes to team evaluation. The fact that shot information is predictive is enough to convince most people (including me) that it is useful information to have. But how can it be that something predictive is not also explanatory? How can shots help to predict future outcomes, and yet not be able to explain the outcomes of those games in which they occurred?

You've probably already spotted the subtle differences between explaining and predicting, but let me take a shot (I promise that was an accident). Within a match, correlations are confusing due to all kinds of confounding variables. The answer to this question would clear some things up: "Who was winning the game when all these shots were happening?" Let's explore.

Typically, home teams outshoot away teams 14.3 to 11.3 per 96 minutes of play, and 14.1 to 10.7 in even gamestates. But when they're winning, home lose the shots battle 13.0 to 12.2, likely more content to sit on their leads. When they're losing and desperate for points,  home teams outshoot the visitors by a huge margin, 17.2 to 9.3. So I would argue that the goal differential (gamestate) influences the shot differential as much as the shot differential influences the goal differential.

It's no wonder that in-game correlations between shots and goals are non-existent. Early on in games, the team that gets more shots tends to take the lead. But once they have the lead, those teams tend to  ease up on shots. Thus whenever a team "holds on" to win a game, it very likely had a shot advantage at some point, and then relinquished that shot advantage in attempting to preserve the lead. Without taking into account the gamestates, a superficial analysis would suggest that shots do not correlate to wins. 

I have done nothing with shot quality here, but that wasn't really the point. The point was to show that in-game correlations have to be treated with a lot of care if you want to come to any conclusions about causation. But for the curious, the in-game correlation between Expected Goal Differential 3.0 and final goal differential was 0.37 (0.32, 0.42). Though gamestates are still an issue, shot quality is able to account for the fact that the losing team will be taking lower quality shots, and we get something sort of intuitive.