By Matthias Kullowatz (@MattyAnselmo)
It is my opinion that a statistic capable of predicting itself---and perhaps more importantly predicting future success---is a superior statistic to one that only correlates to "simultaneous success." For example, a team's actual goal differential correlates strongly to its current position in the table, but does not predict the team's future goal differential or future points earned nearly as well. I created the expected goals metrics to be predictive at the team level, so without further ado, let's see how the 2.0 version did in 2013.
In predicting future goals scored and allowed, the baseline is to use past goals scored and allowed. In this case, expected goals beats actual goals in its predictive ability by quite a bit.*
|xGD (by gamestate)||GD (last 17)||0.805||0.000|
|xGD(first 17)||GD (last 17)||0.800||0.000|
|xGA (first 17)||GA (last 17)||0.604||0.000|
|GD (first 17)||GD (last 17)||0.487||0.000|
|xGF (first 17)||GF (last 17)||0.409||0.004|
|GA (first 17)||GA (last 17)||0.239||0.024|
|GF (first 17)||GF (last 17)||0.155||0.099|
Whether you're interested in offense, defense, or differential, Expected Goals 2.0 outperformed actual goals in its ability to predict the future (the future in terms of goal scoring, that is). That 0.800 R-squared figure for xGD 2.0 even beats xGD 1.0, calculated at 0.624 by one Steve Fenn. One interesting note is that by segregating expected goals into even gamestates and non-even gamestates, very little predictive ability was gained (R-squared = 0.805).
Most of those statistics above showed some predictive ability in 17 games, but what about in fewer games? How early do these goal scoring statistics become stable predictors of future goal scoring? I reduced the games played for my predictor variables down to four games---the point of season we are currently at for most teams---and here are those results.
|xGD (by gamestate)||GD (last 30)||0.247||0.104**|
|xGA (first 4)||GA (last 30)||0.236||0.033|
|xGD(first 4)||GD (last 30)||0.227||0.028|
|xGF (first 4)||GF (last 30)||0.140||0.093|
|GF (first 4)||GF (last 30)||0.022||0.538|
|GD (first 4)||GD (last 30)||0.015||0.616|
|GA (first 4)||GA (last 30)||0.003||0.835|
Some information early on is just noise, but we see statistically significant correlations from expected goals on defense (xGA) and in differential (xGD) after only four games! Again, we don't see much improvement, if any at all, in separating out xGD for even and non-even gamestates. If we were to look at points in the tables as a response variable, or perhaps include information on minutes spent in each gamestate, we might see something different there, but that's for another week!
Check out the updated 2014 Expected Goals 2.0 tables, which now just might be meaningful in predicting team success for the rest of the season.
*A "home-games-played" variable was used as a control variable to account for those teams who's early schedule are weighted toward one extreme. R-squared values and p-values were derived from a sequential sum of squares, thus reducing the effects of home games played on the p-value.
**Though the R-squared value was higher, splitting up xGD into even and non-even game states seemed to muddle the p-values. The regression was unsure as to where to apportion credit for the explanation, essentially.