The Predictive Power of Shot Locations Data / by Andrew Olsen

Two articles in particular inspired me this past week---one by Steve Fenn at the Shin Guardian, and the other by Mark Taylor at The Power of Goals. Steve showed us that, during the 2013 season, the expected goal differentials (xGD) derived from the shot locations data were better than any other statistics available at predicting outcomes in the second half of the season. It can be argued that statistics that are predictive are also stable, indicating underlying skill rather than luck or randomness. Mark came along and showed that the individual zones themselves behave differently. For example, Mark's analysis suggested that conversion rates (goal scoring rates) are more skill-driven in zones one, two, and three, but more luck-driven or random in zones four, five, and six. Piecing these fine analyses together, there is reason to believe that a partially regressed version of xGD may be the most predictive. The xGD currently presented on the site regresses all teams fully back league-average finishing rates. However, one might guess that finishing rates in certain zones may be more skill, and thus predictive. Essentially, we may be losing important information by fully regressing finishing rates to league average within each zone.

I assessed the predictive power of finishing rates within each zone by splitting the season into two halves, and then looking at the correlation between finishing rates in each half for each team. The chart is below:

Zone Correlation P-value
1 0.11 65.6%
2 0.26 28.0%
3 -0.08 74.6%
4 -0.41 8.2%
5 -0.33 17.3%
6 -0.14 58.5%

Wow. This surprised me when I saw it. There are no statistically significant correlations---especially when the issue of multiple testing is considered---and some of the suggested correlations are actually negative. Without more seasons of data (they're coming, I promise), my best guess is that finishing rates within each zone are pretty much randomly driven in MLS over 17 games. Thus full regression might be the best way to go in the first half of the season. But just in case...

I grouped zones one, two, and three into the "close-to-the-goal" group, and zones four, five, and six into the "far-from-the-goal" group. The results:

Zone Correlation P-value
Close 0.23 34.5%
Far -0.47 4.1%

Okay, well this is interesting. Yes, the multiple testing problem still exists, but let's assume for a second there actually is a moderate negative correlation for finishing rates in the "far zone." Maybe the scouting report gets out by mid-season, and defenses close out faster on good shooters from distance? Or something else? Or this is all a type-I error---I'm still skeptical of that negative correlation.

Without doing that whole song and dance for finishing rates against, I will say that the results were similar. So full regression on finishing rates for now, more research with more data later!

But now, piggybacking onto what Mark found, there does seem to be skill-based differences in how many total goals are scored by zone. In other words, some teams are designed to thrive off of a few chances from higher-scoring zones, while others perhaps are more willing to go for quantity over quality. The last thing I want to check is whether or not the expected goal differentials separated by zone contain more predictive information than when lumped together.

Like some of Mark's work implied, I found that our expected goal differentials inside the box are very predictive of a team's actual second-half goal differentials inside the box---the correlation coefficient was 0.672, better than simple goal differential which registered a correlation of 0.546. This means that perhaps the expected goal differentials from zones one, two, and three should get more weight in a prediction formula. Additionally, having a better goal differential outside the box, specifically in zones five and six, is probably not a good thing. That would just mean that a team is taking too many shots from poor scoring zones. In the end, I went with a model that used attempt difference from each zone, and here's the best model I found.*

Zone Coefficient P-value
(Intercept) -0.61 0.98
Zones 1, 3, 4 1.66 0.29
Zone 2 6.35 0.01
Zones 5, 6 -1.11 0.41

*Extremely similar results to using expected goal differential, since xGD within each zone is a linear function of attempts.

The R-squared for this model was 0.708, beating out the model that just used overall expected goal differential (0.650). The zone that stabilized fastest was zone two, which makes sense since about a third of all attempts come from zone two. Bigger sample sizes help with stabilization. For those curious, the inputs here were attempt differences per game over the first seventeen games, and the response output is predicted total goal differential in the second half of the season.

Not that there is a closed-the-door conclusion to this research, but I would suggest that each zone contains unique information, and separating those zones out some could strengthen predictions by a measurable amount. I would also suggest that breaking shots down by angle and distance, and then kicked and headed, would be even better. We all have our fantasies.