Explaining our 2020 MLS playoffs projections

By Matthias Kullowatz (@mattyanselmo)

Predicting playoff outcomes in MLS has always been particularly difficult. While about 400 regular season games may seem like a lot, it is still not even close to enough of a sample size to home in on fine differences between teams through the data alone. And now, with a COVID-shortened season and fake home games, it’s even more difficult. With that said, here are our model’s predicted probabilities of each team making it to each stage of the MLS Cup Playoffs, along with the implied championship probabilities from the Bovada sportsbook and championship probability differences between the two. I’ll go over what we did to produce these predictions, and what missing information could make them better.

MLSPlayoffsTable_2020.PNG

How do we derive these predictions?

First, we need a model that tells us the probability that any given team will advance against any other given team. So when the Portland Timbers meet the Philadelphia Union in the championship game, we need a model to tell us that the Timbers will have a 31% probability of beating Philly (that’s what it says!) and taking home the hardware. We’ve chosen to use a combination of two Poisson models, one for the home team and one for the away team.

The home team’s model predicts the mean, or expected, number of goals the home team will score based on team-level metrics like the home team’s goals added (g+) produced during the season, g+ produced last season, and g+ produced in the most recent 10 games. That model also includes the away team’s g+ allowed during all of those aforementioned time frames. That seems like the relevant information to determine how many goals the home team is likely to score. We build a second, mirrored model for the away team with their g+ for and the home team’s g+ against. Once we have the mean predictions for each team, we make the simplifying assumptions that the two goal totals are independent of one another, and that they follow a Poisson distribution. These assumptions, like pretty much all components of any model, are “wrong” in some sense, but less wrong than you might think.

One perhaps less obvious challenge here is choosing on what data to train the model. Typically the playoffs start after 34 games, and you might imagine that we just use all past playoff games to fine tune the relationships between a team’s historical g+ metrics and its future goals scored. There are at least two major problems with that approach. There just aren’t enough playoff games since 2013 to tune a decent model, and even if there were, teams have played about 23 games this season rather than 34. The relative meaningfulness of recent-10-game g+ vs. season-long g+ vs. last-season g+ all change when you’ve only played 23 games. A team’s recent 10-game stretch, as well as what they did last season, both become relatively more predictive information for the 2020 playoffs.

So instead we tune an xGBoost gradient-boosted trees model to all regular season games since 2013—more than 2,500 games—and feed a “games played up to this point” variable into the algorithm. In this way, the model is allowed to look for meaningful correlations between the g+ metrics, and differentiate just how meaningful those correlations are early in the season vs. late in the season. The model is likely to find that the prior season’s g+ totals are a better predictor of future goals scored when it’s the 4th game of the season, but the current season’s g+ totals are more important when it’s the 34th game of the season. This approach gives us many more data points to work with, and allows us to make predictions of goal totals in future games, whether teams have played a few games, 23 games, or 34 games.

MLSPlayoffsTable_PHIPOR_2020.PNG

For any given game, say that potential Portland-at-Philly matchup, we take the model’s expected goal prediction for each team and create two Poisson distributions of outcomes. We specifically use the Poisson distribution to assign a probability to each team scoring 0, 1, 2, 3, etc. goals in the game, based on their respective mean predictions. I’ve included those probabilities for Portland and Philly below. At this point, we have probabilities of specific score lines for every possible matchup in MLS. The only thing left to do is simulate it. In the example to the right, a 2-1 Philly win would have a 0.270 x 0.361 = 9.7% chance of occurrence, while a 2-1 Portland win would have just a 0.219 x 0.258 = 5.7% chance of occurrence. Makes sense, as Philly is the better team, and they’re playing at home.

Because MLS has changed playoff formats far too often, I feel like I write a new simulation every year. But in reality, the basic structure of the playoffs, and thus the simulation, is the same. Rounds are either knockout or home-and-home, and the structure is a typical bracket style where the path is determined ahead of time (as opposed to UEFA Champions League, where they randomly draw the rounds of 16, 8, and 4). Using the goal-scoring outcome probabilities described above, the simulation basically determines the (simulated) outcomes based on a random number generator.

Let’s stop for a minute and answer this question: why simulate? Well, even if we simplify this year’s playoffs to a 16-team, single-elimination bracket, ignoring the play-in games, there are 2^15 = 32,768 distinct outcomes this tournament could realize. With the play-in games, there are 131,072. There are some shortcuts that would allow us to derive “exact” probabilities from the Poisson models, rather than simulated ones, but the computational costs of 10,000 or even 100,000 simulated tournaments is virtually nothing—maybe letting the computer run for an hour at most. I chose 10,000 because the maximum standard error of any given probability is 0.5%, which is like saying we’re reasonably confident the simulation matches those “exact” probabilities to within a percent.

Back to how the simulation is programmed. If a team has a 36.1% chance to score one goal against a particular opponent, then across many simulations it will score one goal about 36.1% of the time. The greater the number of simulations, the closer to 36.1% It’s meant to model the randomness and volatility of actual soccer results, which—unrelated—make the playoffs fun to watch. The simulation compares the simulated goals scored of the home and away teams, and allows the simulated winner to advance and meet another simulated winner. We repeat that process until we have a champion. We record each simulated result along the way to know which teams made it how far, and then we simulate the whole tournament again. 9,999 additional times. Because we record the simulated results of each game across all 10,000 simulations, we just add up the number of times each thing happened. Philadelphia won the tournament about 2,380 times out of 10,000 simulations, or 23.8%.  

Home-and-homes, as well as the potential for extra time, are good reasons for our Poisson approach. The Poisson distribution can easily be scaled for a 180-minute home-and-home marathon or 30 combined minutes of extra time periods. Again, teams don’t play exactly the pace in a 30-minute extra time as they do in a 90-minute regular time, but the assumption is probably close enough to reality. It still gives the better team an advantage, for example, in extra time, but not as much of an advantage as it had across the first 90 minutes. Additionally, because goals scored and allowed play a role in breaking seeding ties, the Poisson approach is convenient for simulating the regular season, as well, as it allows for ties and records simulated goal totals, rather than just wins and losses.

The ASA model (results at the top) likes Philadelphia, Sporting KC, and Toronto noticeably more than do the bettors at Bovada. But before you go putting down big bucks on these teams, it’s worth reading about why our model is wrong.

So why is our model wrong? Other than some of the reasons I’ve mentioned above…

First, our model does not have a good feel for the magnitude of home-field advantage this year. In years’ past, the average home goal differential was about +0.60, with a statistical margin of error of only about 0.07 goals (95% confidence). So we’re pretty confident we know what home-field advantage looks like in MLS in normal years. This season, in games played in home stadiums, the average home goal differential was about +0.5, but with a margin of error of 0.25 goals. That means that true home-field advantage, the theoretical advantage that will be there underlying the playoffs, could reasonably be as low as +0.25 goals. That’s a huge difference from +0.60, and it largely helps to explain why our model is so much higher than Bovada on some of those top teams in each conference. Our model weights prior seasons against current seasons, and basically assumes about a +0.55 advantage for home teams. In aggregate, bettors are almost certainly assuming lower home-field advantage than that.

Another shortcoming of the model is that we didn’t take strength of schedule into account in this version. The schedule this year pitted many teams against each other far too many times. New England, for example, played Philadelphia four times. Using a pretty simple strength of schedule rating—in which I calculated each team’s opponents’ average g+ in games not played against that team—New England played the toughest schedule. I want to explore and test different strength of schedule metrics before implementing them into the playoffs simulation.

This model only uses various summations of goals added (g+). It’s likely that including information about expected goals (xG) earned and actual goals scored could improve the model by “finding blind spots in g+”, as ASA contributor @JmooreQuakes put it. A team that accrues a lot of g+, but doesn’t get good shots off, might actually lack the right personnel to consistently finish off their penetrating possessions. Furthermore, we could use combinations of g+ components and zones to better predict future goal scoring output, such as “receiving value accumulated per 96 in zone 23” (check out team g+ by zones and components here). Why didn’t we include these additional drivers? Basically that takes time, and 2020 isn’t worth it. We’re looking forward to a lot of improvements to these models in 2021.

Finally, similar to the strength of schedule issue, we did not explicitly take into account whether teams were at full strength themselves (or whether they will be for the playoffs). The key drivers in the model are team-level g+ metrics, rather than a sum of player-level g+ metrics for those players available on game day. LAFC, for example, played more than 75% of their minutes without Carlos Vela, who has averaged 0.5 g+ above replacement per game since 2019. That half-a-goal per game could be enough to explain such a large discrepancy in LAFC’s chances between our model and Bovada’s sportsbook.

In addition to the admitted shortcomings of the model, probability projections often just seem bad. The underlying issue is the binary nature of playoff outcomes. A team wins or it loses; it doesn’t win 50% of a game or 24% of a championship. MLS playoffs specifically are quite volatile, due to the nature of the sport as well as all the single-elimination rounds. In a sport where a one-goal differential can win the game, the worse team finds a way to win a single game relatively often. In past seasons, some of the most favorable weekly predictions from our model only gave the favorite team a 65% or 70% probability to win at home. It’s not hard to find examples this year when a heavy underdog won. On September 27, San Jose played at LAFC as a 1.0-goal underdog according to our model, and of course they won by a goal.

That two-goal shift from the expected differential to the actual differential in the LAFC-SJ game is well within the normal variance of a soccer game. If, for example, LAFC were to host SJE in the playoffs—it would be in the Western Conference Finals, by the way—our model suggests that LAFC has a 2% chance to win by more than 5, and SJE has a 2% to win by my than 2. In effect, the 95% prediction interval includes a 7-goal outcome range. You can see something similar in the goal probabilities from the potential Portland-Philly matchup shown earlier. Soccer is wild.

So, yeah, the favorite team in the tournament is not going to be very favored. And that often leads to models looking very “wrong” to casual fans. We’ve told you that Philadelphia is favored to win MLS Cup—favored in that they have the highest probability of any team to do so—but you would take the field in an even odds bet every day of the week. Welcome to the MLS Cup Playoffs.