By Matthias Kullowatz (@MattyAnselmo)

It is my opinion that a statistic capable of predicting itself---and perhaps more importantly predicting future success---is a superior statistic to one that only correlates to "simultaneous success." For example, a team's actual goal differential correlates strongly to its current position in the table, but does not predict the team's future goal differential or future points earned nearly as well. I created the expected goals metrics to be predictive at the team level, so without further ado, let's see how the 2.0 version did in 2013.

Mid-season Split

In predicting future goals scored and allowed, the baseline is to use past goals scored and allowed. In this case, expected goals beats actual goals in its predictive ability by quite a bit.*

  Predictor   Response   R2   P.value   
  xGD (by gamestate)   GD (last 17)   0.805   0.000  
  xGD(first 17)   GD (last 17)   0.800   0.000  
  xGA (first 17)   GA (last 17)   0.604   0.000  
  GD (first 17)   GD (last 17)   0.487   0.000  
  xGF (first 17)   GF (last 17)   0.409   0.004  
  GA (first 17)   GA (last 17)   0.239   0.024  
  GF (first 17)   GF (last 17)   0.155   0.099  

Whether you're interested in offense, defense, or differential, Expected Goals 2.0 outperformed actual goals in its ability to predict the future (the future in terms of goal scoring, that is). That 0.800 R-squared figure for xGD 2.0 even beats xGD 1.0, calculated at 0.624 by one Steve Fenn. One interesting note is that by segregating expected goals into even gamestates and non-even gamestates, very little predictive ability was gained (R-squared = 0.805).

Early-season Split

Most of those statistics above showed some predictive ability in 17 games, but what about in fewer games? How early do these goal scoring statistics become stable predictors of future goal scoring? I reduced the games played for my predictor variables down to four games---the point of season we are currently at for most teams---and here are those results.

  Predictor   Response   R2   P.value   
  xGD (by gamestate)   GD (last 30)   0.247   0.104**  
  xGA (first 4)   GA (last 30)   0.236   0.033  
  xGD(first 4)   GD (last 30)   0.227   0.028  
  xGF (first 4)   GF (last 30)   0.140   0.093  
  GF (first 4)   GF (last 30)   0.022   0.538  
  GD (first 4)   GD (last 30)   0.015   0.616  
  GA (first 4)   GA (last 30)   0.003   0.835  

Some information early on is just noise, but we see statistically significant correlations from expected goals on defense (xGA) and in differential (xGD) after only four games! Again, we don't see much improvement, if any at all, in separating out xGD for even and non-even gamestates. If we were to look at points in the tables as a response variable, or perhaps include information on minutes spent in each gamestate, we might see something different there, but that's for another week!

Check out the updated 2014 Expected Goals 2.0 tables, which now just might be meaningful in predicting team success for the rest of the season.

*A "home-games-played" variable was used as a control variable to account for those teams who's early schedule are weighted toward one extreme. R-squared values and p-values were derived from a sequential sum of squares, thus reducing the effects of home games played on the p-value.

**Though the R-squared value was higher, splitting up xGD into even and non-even game states seemed to muddle the p-values. The regression was unsure as to where to apportion credit for the explanation, essentially.

American Soccer Analysis

Predictive strength of Expected Goals 2.0

Mid-season Split

Early-season Split