We Have A New Win Probability Model

By Tyler Richardett

Across recent weeks, we’ve set out to improve the performance of our in-game win probability model, while: 

  1. starting to take each team’s strength into account, based on its performance in prior games; and 

  2. introducing more fluctuation between goal-scoring events, to better reflect teams’ chance creation throughout the game.

In this article, we’ll cover our methods for accomplishing those goals, how we plan to use this new and improved model, and how deconstructing that model can teach us more about the conditions under which goal-scoring events occur.

How the model works

Intuition tells us that at any given moment during a game, some of the greatest factors influencing the outcome include: the amount of time remaining, the current scoreline, each teams’ relative strength, and home-field advantage. However, prior work on this subject tells us that time has an outsized impact, resulting in non-linear dependencies among other variables. (Early on, to predict college basketball results, independent logistic regression models were trained for each time interval, and their predictions strung together chronologically.)

In soccer, draws add an additional layer of complexity. Previously, we built and maintained two different sets of models — one to predict wins and the other to predict draws.

Our new approach collapses these into a single model, with a few very key distinctions. Full credit for this framework goes to our friends at KU Leuven, whose work is available here.

First, rather than modeling the outcome at each time interval, we’re instead trying to predict the probability of each team scoring a goal in each of the following time intervals. And second, we’re using those predicted likelihoods to run 10,000 simulations of goal-scoring activity over the remainder of the game. Those simulated goals are combined with goals which may have already been scored, and the resulting scorelines give us our win, draw, and loss probabilities. Finally, to simplify the variable length of games (including stoppage time), one time frame represents one percentage of a game played, relative to that game. For the sake of simplicity, we’ll assume for the remainder of this article that each time frame corresponds to about one minute of gameplay.

To provide an illustrative example, let’s say FC Tucson leads the Richmond Kickers 2–1 with 40 minutes of play remaining. And based on a number of features detailed below, our model tells us that Tucson has a constant 0.7% probability of scoring across each of those 40 minutes, and that Richmond has a 3.8% probability. So, we flip a weighted coin for 40 trials over 10,000 simulated games, tally those respective results, and find that Tucson wins 2,394 times, the two teams draw another 2,701 times, and Richmond wins 4,905 times. In turn, we predict that the Kickers have a 49% chance of overcoming the 2–1 deficit and winning the game.

This framework accounts for two of the aforementioned factors — the amount of time remaining and the current scoreline — in a heavy-handed, yet effective way. It’s also worth noting this approach carries the assumption that each team can only score up to one goal per minute. And although it’s highly unlikely a single team will score two or more goals in a single minute — we only encountered this once among more than 1.5 million observations, or 0.00006% — it’s not impossible.

As for the model itself, that’s trained using a popular gradient boosting algorithm and contains the following features:

  • Game state: Score differential, player differential, number of goals scored, number of yellow and red cards issued

  • Team attributes: Home team and expansion season indicators

  • Team strengths: Offensive and defensive measures of strength for each team and its opponent; calculated using goals added (g+) for and against over a set of recent games

  • Chance creation: Difference in offensive g+ earned in recent minutes

How the model performs

Borrowing once more from our Benelux-based peers, we evaluated our model using a measure known as ranked probability score, or RPS. Its calculation is as follows:

$$RPS = \frac{1}{r-1} \sum_{i=1}^r (\sum_{j=1}^i p_j - \sum_{j=1}^i e_j)^2$$

In simpler terms, RPS operates as a loss function: Values closer to zero are favorable, and those closer to one are not. When compared against a measure such as prediction accuracy, RPS has two distinct advantages. First, it considers the ordinal nature of possible outcomes. In other words, predicting a win and observing a draw is penalized less than predicting a win and observing a loss. Second, RPS considers the actual probabilities predicted for each outcome, whereas accuracy only considers which outcome has the greatest predicted likelihood. For instance, if the model declared that the losing team would win 45% of the time, RPS would penalize that incorrect prediction less harshly than, say, a 60% prediction. By contrast, prediction accuracy would penalize both the same — as simply an incorrect prediction.

For side-by-side comparisons, both measures are included in the graphs below.

The following section will cover this in greater detail, but on first glance, it was immediately clear that the shift in objective had the greatest positive impact — with the newly added measures of team strength in a close second.

As the figure heading indicates, this new model produced a 15% performance improvement toward the beginning of games, with diminishing returns over time. (That latter effect is to be expected: Outcomes become more and more likely to hold constant as time expires.) 

What’s more, we can boast a 56% pre-game prediction accuracy rate. That’s pretty impressive in any sporting context — much less for low-scoring games played in leagues with a relatively high amount of parity.

What the model tells us about goal-scoring events

Taking advantage of common model interpretability techniques, we can peel back the curtain and observe some of the patterns our model picked up on. Here, we’ll use SHAP, which is a self-described “game theoretic approach to explain the output of any machine learning model.”

The core belief around which SHAP is based is that feature values must together have additive effects on an individual observation’s prediction — the very same way you might interpret the global coefficients produced by a simple regression model. To estimate these effects, the algorithm iterates through each feature, testing every possible permutation of remaining features, with and without the \(i\)th feature.

Below, we’ve taken 10,000 observations at random and applied this algorithm. The SHAP values along the x-axis represent the positive, negative, or neutral effect each feature value had on the prediction (goal-scoring likelihood) for the respective observation. And the blue-red color scale represents the relative values of the features used to make the prediction. For example, if you look at the player difference feature toward the center, the lightest blue represents occurrences in which that team had a two-person deficit, whereas the deepest red represents occurrences in which that team had a two-person advantage.

As we surmised, team measures of strength had the greatest importance in this model. And this tracks with a prior finding that prior goals added (g+) tallies are a good predictor of future success. In keeping with that trend, this exercise turned up a number of expected, and thus reassuring, trends:

  • Generally, the greater the lead a winning team is enjoying, the less likely they are to push for more goals.

  • Home-field advantage yields more goals.

  • Extended spells with more valuable possessions than the opponent increases a team’s chances of scoring.

  • Having a greater number of players on the field gives a team a greater chance of scoring.

Disappointingly, there were no distinct, league-specific effects that we could find.

How we plan to use this model

This newer model and framework replaces the current methodology we use to calculate the points added (PA) and expected points added (xPA) values in our xG tables. In the not-quite-near future, we’ll use this to predict outcomes for the coming week’s slate of games and add that feature back into the app. And in the not-too-distant future, we’d like to build out game-level profiles in the app, of which in-game win probability charts (like the ones below) will be a part.

Beyond these more practical applications, this methodology can be used to take a trip down memory lane — identifying some of the most memorable games and/or unexpected outcomes from our data set.

Using a couple quick-and-dirty measures for volatility, this high-scoring, back-and-forth affair between D.C. United and Real Salt Lake in 2015 (unsurprisingly) stood out:

And three goals in quick succession from Reno 1868 (RIP) in the second half of their 2018 matchup with Sacramento Republic led to one of the more noteworthy turnarounds:

Finally, coming off a three-game losing skid in 2016, the model expected the Chicago Red Stars to again drop points against newcomers Orlando Pride — only to be proven wrong by a decisive Taylor Comeau goal, delivered in just her second start for Chicago:

Acknowledgments

  1. It’s worth stating once more that credit for reframing these in-game win probability models goes to the brilliant minds at KU Leuven. Their work on this topic is available here. They also regularly share findings from other ongoing projects on their blog.

  2. I’ve almost always got Brad Boehmke and Brandon Greenwell’s Hands-On Machine Learning with R book open in one tab and Christoph Molnar’s Interpretable Machine Learning book in another. Both are excellent resources, independent of skill level. Boehmke and Greenwell cover the type of gradient boosting used here in Chapter 12, and Molnar covers SHAP values (among other model-agnostic interpretability methods) in Chapter 5.

  3. For an 18-minute introduction to SHAP values, I’d recommend this 2019 presentation from one of the paper’s authors. Although the framework and its complementary parts are maintained primarily as a Python library, they’re pretty well-integrated into later versions of the xgboost library in R. I adapted this source code, plus this revamped version from Pablo Casas, to create the beeswarm plots above.

  4. Thanks to the authors and maintainers of the {QuantTools} package for their handy method for calculating exponential moving averages. And thanks to the authors and maintainers of the {verification} package for their handy method for calculating ranked probability scores (RPS).