By Matthias Kullowatz (@mattyanselmo)
In the offseason we upgraded our passing model, and its outputs are now featured in our xPassing tables (both interactive and static). After a few minor tweaks this week, now is as good a time as any to explain how it works.
Much like our Expected Goals (xG) models, the purpose of this model is to estimate the probability of success. Only, in this case, a success is a pass that is completed rather than a shot that is scored. For example, if Player A is passing the ball towards Player B, we can assign a likelihood of that pass being completed. We do this based on a variety of factors, such as the circumstances and player position on the field, pass type, and the direction of the pass. And in this case, we opted to use a gradient-boosted ensemble of decision trees (GBM) rather than a logistic regression model (GLM).
There are two key advantages to using a GBM. First, they can naturally model non-linear relationships. An easy example of such a relationship would be the one between the angle of the pass and its probability of success. Success probabilities are similarly average for lateral passes to the right or left (+90 or -90 degrees in the plot at right), higher for backward passes (+180 or -180 degrees), and lower for forward passes (0 degrees).
The second advantage is the GBM’s ability to naturally model interactions between predictor (or “independent”) variables. A simple example of that would be the interaction between the x- and y-coordinates of the origin of a pass. The plot below shows that it’s generally harder to complete passes along the sideline, but it’s especially hard to do so deep in the opponent’s third.
Passing angle and the coordinates of the origin are certainly important variables, but there are many others that inform the model on the probability of a successful pass. I have reviewed each of them for the interested reader below, often along with relevant odds ratios. An odds ratio is one measure of how much more likely or less likely a pass is of being completed, all else held equal. I have presented odds ratios below such that they are all greater than 1.0 so that the magnitudes of the various effects are more easily comparable.
I have lumped a number of indicators about a pass into “pass type.” These are included in the table with their approximate corresponding odds ratios. Odds ratios should be seen as a simplification of the model, and not the full story of what the model is doing. For example, most corners are crosses. The odds ratio for a corner that is crossed, however, is not 1.5 x 5.75 = 8.625, as the table below would suggest. In fact, a corner kick cross is more likely to be completed than a cross from near the corner flag in the run of play.
Circumstances and Player Position
Circumstances include whether the passing team was playing at home, the player differential between the passing team and defending team (due to red cards), and whether the pass is a kick off (which is virtually always completed). These effects should be fairly intuitive. What’s not intuitive is how to deal with player position. Though knowing that a player is a midfielder or a forward does help the model to better estimate the probability of a successful pass, there are caveats that led to our decision to lump all field players together and all goal keepers together.
First, do we even want the model to consider position? If so, it would mean that the model would predict different probabilities of success for the exact same pass attempt just because one of the players nominally plays a different position. We prefer to summarize by position after fitting the model, not in the model itself.
Second, how to determine player position is not straight forward. If we want to utilize this model on new data going forward, then we would need to establish rules on how many games/minutes played at a certain position it took before we label a player. And in the meantime, how do we label players while we’re waiting for them to accrue enough service time?
These are surmountable obstacles, but I’m not wholly convinced the model would be more useful if we spent time working position in. The odds ratios are to the right.
Passing distance is obviously a key determinant of whether a pass will be successful. The problem is this: for most incomplete passes we don’t get a very exact estimate for the distance the player was trying to pass the ball. The end position of the ball in our dataset is where it ended up, whether that was at the feet of a friend or foe, or in the second deck.
If a long ball is blocked, for example, it will look like a very short attempt in the dataset, and the model will score it with a high probability of being completed. Whenever biases in a predictor variable are associated with what you’re trying to predict, it’s not a good idea to use that variable. Fortunately, we have a long ball indicator. This indicator was presumably recorded based on the passer’s intent, and is therefore a valid predictor variable. The odds of a long ball being completed are about 4.00 times lower than a short ball.
Like player position, gamestate plays a role in pass completion rates. And for similar reasons, we’re choosing not to model it. We’d rather assess passing performance within gamestate segments after the model is fit, rather than infusing its effect into the model. This matches how we built the xGoals models, too, for consistency.
Quick Primer on Odds and Odds Ratios
Odds are not the same as probability in this context, though colloquially they can be used interchangeably. Odds are technically equal to probability/(1 – probability). So here’s how you apply an odds ratio.
Suppose the typical pass is completed at a 75% success rate (it is, approximately). Then the baseline odds are 75/25 = 3.
Suppose a player is attempting a pass from a fairly typical location, say middle of the field, but he’s trying to play a long ball into the corner. The odds ratio is about 4.00 on long balls (in the less likely direction).
Take the odds of a typical pass and divide by the odds ratio to get 3/4 = 0.75. We divide because the effect is a lower probability. These are the approximate odds of completing such a long ball.
Convert back to probability. Probability = Odds/(1 + Odds), or 0.75/1.75 = 43%.
I won’t show validation across every possible model variable, but I’ve selected three that I think are most interesting. This validation is done on a holdout dataset, meaning that the model was fit to 2015-2016 data, used to score passes for 2017-2018 (through April 15), and then those scored probabilities were compared to actual successful proportions in a few segments of the data.
First, here are the actual completion rates minus the estimated rates by zone of the field:
Red zones show the greatest absolute deviation from the passing model. The largest raw error is deep in the attacking half, left of center, but even that is only a 1.0% difference. Most of the zones show a difference of less than 0.5%. The most statistically significant difference is the zone on the defensive side of half right of center, where players were expected to complete 84.8% of their 38,105 pass attempts but have actually completed 85.6%. The model isn't so far from actuality that we should want to go do it all over again, and in practice we fit the model all the way through 2017 to score passes in 2018, so there is even more data on which to tune it.
The other two variables I wanted to look at were the ones most notably missing from the model: player position and gamestate. We purposefully left these out of the model for reasons given above, but it’s interesting to see how we should adjust our expectations across these two variables. Below I show actual completion rates minus estimated rates by player position:
Players that have made the majority of their passes from a forward position (F)—which is how we define a player’s position—tend to underperform the model on average. In other words, a typical forward will complete about 5% fewer passes than the model suggests. On the flip side an average central defender (D) should complete about 3% more of his passes than the model expects. These discrepancies are due to things the model doesn’t capture, such as the level of defensive pressure and the intent of the player. A forward attempting a pass in the defensive third is more likely to be trying to get the ball up the field and out of danger, whereas the majority of passes in the defensive third made by defenders aim for conservative possession retention.
We see an interesting relationship between pass completion rates and gamestate. All else equal, a team that is either losing or winning handily tends to over-perform the model’s expectations. Teams that are ahead by a lot are perhaps content to sit on the ball, and teams that are losing may face less pressure.
There are probably many other valid theories that you’ve already thought of to explain these discrepancies. But the bottom line is that you can filter by player position and third of the field in our interactive tables, so you can control for (some of) these things that are missing from the model as you sort the most efficient passers.