By Matthias Kullowatz (@mattyanselmo)
It was more than two years ago that we built the current model for determining the expected goals of each shot, so let’s go back and see how it’s doing. For those interested, I've included some R code for fitting our generalized linear model (GLM), as well as a gradient-boosted tree model (GBM) for comparison. I selected the training dataset to be shots from 2011 - 2014, and the validation dataset to be shots from 2015 and 2016. Actual and predicted goals per shot are shown across each variable of the model. Here's the punchline: our model is doing pretty well.
First, I fit the original model as seen on the ASA website. This is a logistic generalized linear model, which is designed to predict the probability of binary outcomes like shots (goal vs. not goal). Coefficients will differ somewhat from what we posted long ago, as this is a different training dataset.
|Estimate||Std. Error||z value||Pr(>|z|)|
Next, I fit the GBM, a tree-based model that isn’t constrained by a linear formula. The parameters include number of decision trees (n.trees), the number of splits on each tree (interaction.depth), the minimum number of observations per final branch (n.minobsinnode), and the learning rate (shrinkage). I used the caret package to tune these four parameters. The most important variables in improving the fit of this model were distance and pattern of play, which makes sense especially when you remember that pattern of play includes penalty kicks.
Here I predicted the scoring rate, or expected goals, for each shot in the entire dataset, including training and validation. I appended those columns of predictions from each model onto the original shots dataset.
In this section, I produced validation plots for each variable in the model. I show the actual goal scoring rates with a solid blue line and the predicted goal scoring rates (xGoals) with a dashed green line, with labels on the right y-axis. In the background, the grey bars show the sample size (number of shots) in each bucket, labeled on the left y-axis. The results of both the original ASA model and the GBM are shown.
The ASA model fit to 2011 - 2014 data predicts “future” goal scoring rates pretty well. That is, the dashed lines stay close to the solid lines across most buckets of each variable, and only diverge noticeably where there are very few shots taken. Additionally, the linear model appears to keep pace with the GBM based on the graphs, which is somewhat surprising given that GBM’s are known for their ability to handle non-linear relationships and complex interactions naturally. More objectively, the holdout-sample log-likelihood error is only slightly greater for the linear model (4620.1) than it is for the GBM (4614.2). In this case, it seems that the linear patterns in the log-odds of goal scoring held pretty well, and our sneaky non-linear adjustments–like logged distance and quadratic goal mouth available–helped the linear model to match the fancy pants model.
*I can't share code for the validation plots because I wrote that function for my company, but basically they are line graphs laid over bar plots, with some trickery to line up the axes. I got a lot of help from StackExchange in making those functions.