mls soccer stabilization

Does last season matter? by Drew Olsen

We've shown time and time again how helpful a team's shot rates are in projecting how well that team is likely to do going forward. To this point, however, data has always been contained in-season, ignoring what teams did in past seasons. Since most teams keep large percentages of their personnel, it's worth looking into the predictive power of last season. We don't currently have shot locations for previous seasons, but we do have general shot data going back to 2011. This means that I can look at all the 2012 and 2013 teams, and how important their 2011 and 2012 seasons were, respectively. Here goes.

First, I split each of the 2012 and 2013 seasons into two halves, calculating stats from each half. Let's start by leaving out the previous season's data. Here is the predictive power of shot rates and finishing rates, where the response variable is second-half goal differential.

Stat

Coefficient

P-value

Intercept

-28.36792

0.04%

Attempt Diff (first 17)

0.14244

0.00%

Finishing Diff (first 17)

77.06047

1.18%

Home Remaining

3.37472

0.03%

To summarize, I used total shot attempt differential and finishing rate differential from the first 17 games to predict the goal differential for each team in the final 17 games. Also, I controlled for how many home games each team had remaining. The sample size here is the 56 team-seasons from 2011 through 2013. All three variables are significant in the model, though the individual slopes should be interpreted carefully.*

The residual standard error for this model is high at 6.4 goals of differential. Soccer is random, and predicting exact goal differentials is impossible, but that doesn't mean this regression is worthless. The R-squared value is 0.574, though as James Grayson has pointed out to me, the square root of that figure (0.757) makes more intuitive sense. One might say that we are capable of explaining 57.4 percent of the variance in second-half goal differentials, or 75.7 percent of the standard deviation (sort of). Either way, we're explaining something, and that's cool.

But we're here to talk about the effects of last season, so without further mumbo jumbo, the results of a more-involved linear regression:

Stat

Coefficient

P-value

Intercept

-31.3994

1.59%

Attempt Diff (first 17)

0.12426

0.03%

Attempt Diff (last season)

0.02144

28.03%

Finishing Diff (first 17)

93.27359

1.14%

Finishing Diff (last season)

72.69412

12.09%

Home Remaining

3.71992

1.53%

Now we've added teams' shot and finishing differentials from the previous season. Obviously, I had to cut out the 2011 data (since 2010 is not available to me currently), as well as Montreal's 2012 season (since they made no Impact in 2011**). This left me with a sample size of 37 teams. Though the residual standard error was a little higher at 6.6 goals, the regression now explained 65.2 percent of the variance in second-half goal differential. Larger sample sizes would be nice, and I'll work on that, but for now it seems that---even halfway through a season---the previous season's data may improve the projection, especially when it comes to finishing rates.

But what about projecting outcomes for, say, a team's fourth game of the season? Using its rates from just three games of the current season would lead to shaky projections at best. I theorize that, as a season progresses, the current season's data get more and more important for the prediction, while the previous season's data become relatively less important.

My results were most assuredly inconclusive, but leaned in a rather strange direction. The previous season's shot data was seemingly more helpful in predicting outcomes during the second half of the season than it was in the first half---except, of course, the first few weeks of the season. Specifically, the previous season's shot data was more helpful for predicting games from weeks 21 to 35 than  it was from weeks 6 to 20. This was true for finishing rates, as well, and led me to recheck my data. The data was errorless, and now I'm left to explain why information from a team's previous season helps project game outcomes in the second half of the current season better than the first half.

Anybody want to take a look? Here are the results of some logistic regression models. Note that the coefficients represent the estimated change in (natural) log odds of a home victory.

 Weeks 6 - 20

Coefficient

P-value

Intercept

0.052

67.36%

Home Shot Diff

0.139

0.35%

H Shot Diff (previous)

-0.073

29.30%

Away Shot Diff

-0.079

7.61%

A Shot Diff (previous)

-0.052

47.09%

Weeks 21 - 35

Coefficient

P-value

Intercept

0.036

78.94%

Home Shot Diff

0.087

19.37%

H Shot Diff (previous)

0.181

6.01%

Away Shot Diff

-0.096

15.78%

A Shot Diff (previous)

-0.181

4.85%

Later on in the season, during weeks 21 to 35, the previous season's data actually appears to become more important to the prediction than the current season's data---both in statistical significance and actual significance. This despite the current season's shot data being based on an ample sample of at least 19 games (depending on the specific match in the data set). So I guess I'm comfortable saying that last season matters, but I'm still confused---a condition I face daily.

*The model suggests that each additional home game remaining projects a three-goal improvement in differential (3.37, actually). In a vacuum, that makes no sense. However, we are not vacuuming. Teams that have more home games remaining have also played a tougher schedule. Thus the +3.37 coefficient for each additional home game remaining is also adjusting the projection for teams who's shot rates are suffering due to playing on the road more frequently. 

**Drew hates me right now.

Signal and Noise in MLS by Drew Olsen

Some Nate Silver guy wrote a whole book about "signal" and "noise" in data, so it must be important, right? Sports produce a lot of statistics, and it turns out that some of those statistics are pretty meaningless---that is, pretty noisy. A pitcher's ERA is sitting below 3.00 after eight starts, but he has more walks than strikeouts. Baseball sabermetricians will tell you that the low ERA is mostly noise, but that the high walk rate is a signal for impending doom. An MLS team leads the league in points per match, but it allows more shots than it earns for itself (note: this team is called "Montreal Impact"). Soccer nerds like me will tell you that its position in the standings is mostly noise, and that its low shots ratio is a signal for impending doom---or something worse than first place, anyway.

The reasoning behind both examples above is basically the same. Pitchers' ERAs, like soccer teams' points earned, are highly variable and unpredictable, while strikeout-to-walk ratios and shots ratios are more consistent. It's better to put your money on something consistent and easy to predict, rather than something variable and hard to predict. Duh, right?

So here's why we like shots data 'round these parts. Below I have provided two charts of MLS data, one from 2012 and one from 2013. I split each season into two parts and then measured the linear predictive power of each stat on itself. Did teams that scored lots of goals early in the season also score lots of goals later in the season? That's the kind of question answered here.

2012 MLS Stat R2 Pvalue 2013 MLS Stat R2 Pvalue
Blocked Shots 37.1% 0.6% Shots off Goal 34.8% 0.8%
Total Attempts 26.1% 2.5% Total Attempts 34.5% 0.8%
Goals 20.3% 5.3% Shots on Goal 29.4% 1.7%
Points 20.1% 5.5% Points 4.1% 40.7%
Shots on Goal 18.2% 6.9% Blocked Shots 1.7% 60.0%
Shots off Goal 3.6% 43.7% Goals 1.5% 61.6%

As an example of what this means, let's consider the attempts stat. Remember that an attempt is any effort in the direction of the goal, so basically an attempt is any shot---on target, off target, or blocked. In each of the past two seasons, MLS teams' attempts totals in the first half of the season were able to help predict their attempts totals in the second half, explaining 26.1% and 34.5% of the variability in second-half attempts, respectively. Those might not seem like high percentages of explanation, but the MLS season is short, and statistically significant predictors are hard to find.

In baseball, such "self-predictors" have been referred to as "stabilization." Stabilization is important because, as mentioned above, stabilization means that a stat is consistent, and that a team is likely to replicate its results in the future. This MLS season, points earned during the first 10 matches were essentially worthless at predicting points earned in the second 10 games. Even over the 34 games each team played in 2012, the stabilization for points earned was not as strong as that of attempts or goals scored.*

The next step is figuring out what predicts future points earned, since it does a pretty lame job of predicting itself. But I'll leave that for another post after I have gathered data going back a few more seasons. The number one takeaway here is that some stats can only tell us what happened, but not what will happen. There is another group of stats that are doubly important because they also stabilize---predicting themselves using smaller sample sizes. Those stabilizing stats (like shot attempts) are the signal amid the sea of noise known most places as "football."

Seattle has only played 21 games, so I cannot do 11-and-11 splits, yet.  Also, as for why shots off goal and blocked shots have essentially switched places, I would wager that's more due to how they are (somewhat) subjectively categorized, but who knows.