By Matthias Kullowatz (@mattyanselmo)
We have updated our playoff seeding projections on our web application, which show the probability that each team finishes in each playoff seed position within its conference. We have done this in years past, but dedicated fans of the site will recognize that this is a bit earlier in the season than usual. Some tweaks to the predictive model have allowed us to publish meaningful predictions sooner! I’m here to tell you about those tweaks.
We built two separate models to predict the expected number of goals the home and away teams will score in a match, respectively. We assume actual goals scored will have a Poisson distribution, and further assume that home and away teams are independent. These are certainly simplifying assumptions, but they produce reasonable estimates for match scores. We need these match scores to keep track of not only wins and draws, but also goals scored and allowed, which are key tie-breakers in playoff seeding.
On the Team xGoals tab, you now have the option to “home-adjust” the team stats, neutralizing the large home-field advantage MLS teams enjoy in their own venues. We now use these home-adjusted stats in the predictive model, as well, which helps account for imbalanced schedules early in the year (hey, Timbers!). It is effectively a strength of schedule adjustment.
As an example of how this is done, consider our xG adjustments. In 2018 home teams averaged 1.66 xG per game and away teams averaged 1.21. Suppose that in a particular match those averages were the exact outcomes; that is, the home team generated 1.66 xG and the away team generated 1.21. Our home adjustment makes sure that each team in such a match will come away with exactly the same xG figure, because that game represented an average outcome. Neither team outperformed expectations. So in 2018, we multiply the home team’s xG by 0.86 and the away team’s xG by 1.18. Applied to this particular match, the adjustment leaves each team with 1.43 xG, and again, that’s by design. For predictive purposes, this game was a tie.
Past season statistics
Just one or two games into a season, no team-level statistics are predictive. But it turns out that a team’s past season contains some signal of what may happen this season. The plot below filters data to the first 15 weeks of the season. The x-axis represents the difference between the home and away team’s xGD from last season, and the y-axis represents the winning percentage of the home team across 1,000 MLS games (~200 in each of five buckets).
There is an obvious relationship between last season’s stats and this season’s results in the first 15 weeks of the season. We work this finding into the model by taking a weighted average of this season’s stats and last season’s stats for each team. The earlier in the season the game takes place, the greater the weight given to last season’s stats.
Interestingly, team salary is not a key driver in the predictive model. Perhaps not surprisingly it becomes a little more predictive as the season wears on, when perhaps richer teams are more able to fill holes. But regardless of why, the model pretty clearly finds only weak relationships between team salary and performance, when accounting for past performance.
We get more data every year, and yet these models are still built on relatively few data points. To help produce predictive models that aren’t overfit to few data, we started using regularization. This is a statistical technique where the coefficients are penalized toward 0 in order to avoid overfitting (to completely oversimplify the process). In fact, for #rstats nuts, we use the mgcv package with penalized cubic splines in the gam() function. This makes me feel more comfortable putting the model fitting on autopilot, as well, knowing that it’s not likely to overfit.
With a predictive model that is more meaningful early in the season, we are able to simulate the remaining games this season to arrive at credible playoff seeds for both conferences. Check out the Predictions tab on the web app!