Do you ever find yourself yelling “JUST SHOOT THE BALL!” at the TV screen? Of course you do, you watch soccer! Sometimes it can be maddening to see your star striker make his/her way into the box, only to futz around with a pass or dribble. At times it doesn’t even matter whether that pass or dribble was successful. Does it seem like your team does it particularly bad? You’re probably not alone.
Psychologists will be quick to point out a thing called negativity bias. Basically, we probably all think our team dilly-dallies in the box more than others because we remember it better. The existence of this bias, by the way, is supported by a convincing amount of experimental evidence. But it begs the question, who is empirically more likely to shoot when they can?
I have to admit, this is a really hard question to tackle. But when there’s data, there’s a way. And fortunately my friends here at ASA have some awesome data, so I gave it a go. To answer this question, I examined field position and the three primary offensive actions: pass, dribble or shoot. The data span all MLS teams and players from the 2017 season (up to week 9).
Here’s what I did. First, I computed the odds of a shot going in from everywhere on the field. Basically expected goals without all the frills, let’s call it “pseudo expected goals” (nerds, see Methodological Note 1, at bottom).
Then, I took data on all shots, passes and dribbles so far in 2017, and measured the probability of a team choosing to do each, dependent on the location on the field. By “location on the field”, here I’m referring to that “pseudo expected goals” metric I mentioned above. To measure these probabilities, I used a statistical technique called multinomial logistic regression, which does exactly what I’m trying to accomplish (see Methodological Note 2).
The result is a really interesting set of probability distributions. Figure 1 shows these distributions for three example teams:
The x-axis here is the quality of goal-scoring position on the field (0 = nobody should ever shoot from here, 1 = right in front of the goal). The heights of the lines show how likely each team is to do those three things at each place on the field. For example, everyone is super-likely to pass when they have the ball near their defending end, as shown by the dark blue line when it’s really high off to the left. Similarly, everyone has a really high probability of shooting (green line) when they’re in a good shooting position, obviously.
What’s interesting is the differences. You can see that Chicago (the first panel) is a team that loves to shoot. They reach at a near-100% chance of shooting even in only moderately-good field positions. Houston, on the other hand, loves to dribble. Even though dribbling (light blue line) never reaches the same heights as shots or passes, Houston is more likely to dribble basically anywhere on the field than the other two teams. They’re also relatively likely to dribble even when they’re in a great shooting position, as shown by the low-ish shot probability and high-ish dribble probability at the right end of their panel. Philadelphia is an example of a team that loves to pass the ball. Their dark blue line stays high much further to the right (into the “shoot the dang ball” territory), being overtaken by shot probability later than the other teams.
I can use these estimates to produce a ranking of teams too. There are a number of way to do this, but one good one is to simply take the area under the curve of each line:
|Team||Total Probability of Shooting||Total Probability of Passing||Total Probability of Dribbling|
You can see that I chose those three teams because they’re the highest ranked in each category. Chicago has the highest total probability of shooting by only a small margin, followed by Orlando and Seattle. Philadelphia has the highest total probability of passing by a lot, with Colorado and New England next behind them.d. Houston, LA and Philadelphia have the highest total probabilities of dribbling, in that order.
It’s also interesting to see how these probability distributions compare in their entirety. In the following graphs, I’ve taken every team and just stacked their graphs on top of each other (separately by conference so it’s easier to see). You can see that there’s a lot of variance in dribbling. Houston (as I’ve already covered) dribbles a lot under a lot of circumstances, but Seattle has a very low and very narrow distribution. On the other hand, the probability distribution for passing is very consistent across teams. Shots are in between.
So what does this mean? I think there are a few things about this analysis that are useful. First, it’s just an objective way to describe team patterns, which is thought-provoking in its own right. It’s also useful to put other statistics into context though. For example, Chicago are actually second-to-last in shots taken so far this year, yet at the top of my table above. If Chicago has a problem, it isn't that they choose not to shoot, it’s that they don’t seem to get the ball in good enough shooting positions. New England is the opposite. They’re near the top of the league in shots taken, despite (according to the above analysis) being rather stingy in their shooting choices. This means they must be getting the ball in goal-dangerous places pretty frequently.
There are, of course limitations to the approach I’ve taken to explore this topic. First of all, this doesn’t really say anything about where the opposing defenders are. Maybe a team choses to shoot liberally because the opposing teams tend to leave them the space to do it. Maybe another team chooses to dribble because it simply isn’t an option to shoot. To put it another way, I’m assuming that all three offensive actions are possible, which isn’t always true (though some players do clearly have a knack for just “finding a way to shoot”). Another is that I’m simplifying some of the complexity of offensive actions by just looking at this “pseudo expected goals” metric. A more complete analysis might look at multi-dimensional patterns (like a heatmap or something…stay tuned for my next article) rather than relying on a data-reduction technique like I’ve done here. Then again, univariate analyses are simpler to interpret. Lastly, there are a handful of statistical assumptions made through the use of logistic and multinomial regressions. I won’t get into those, but they mostly make it hard to do things like quantify uncertainty, compare p-values and so on. I didn’t even go there.
Despite the limitations, I think this is a telling analysis of how teams play. Over time, these figures could be revisited visualize changing patterns or even evaluate something like the effect a new hire has on team “style”. Maybe most of all though, it gives me (a Sounders fan) pause before I annoyingly yell something like “THEY NEVER SHOOT THE BALL WHEN THEY CAN”. That’s just my bias speaking.
Methodological Note 1: What I mean by “pseudo-expected goals”
The basic principle of this analysis is that a player’s location on the field is the main determinant of which type of action a player takes, but that can vary by team. So, the goal is to measure the probability of each action as a function of how good of a shooting location it was taken at. In other words, if a given team has the ball at the top of the 18 (a relatively good shooting location), what’s the probability that they will shoot, pass or dribble? And how does that compare to the top of the 6 (an even better shooting location)? How does that compare to the corner of the box (worse)? What are these probabilities at any arbitrary location in the defensive half (a very bad shooting location)? Key to answering the research question above is to do this for each team.
First, to assess field position. I computed what could be called “pseudo-expected goals”, or expected goals (xG) where the location and angle are the only factors in the metric. Specifically, I estimated the odds of a shot turning into a goal for everywhere on the field using all observed shots (and whether or not they were successful) so far this season. That’s a sample size of 1,244 shots from various locations, 130 of which found the back of the net. This was done using a simple logistic regression (per usual for xG-like analyses). I limited this to shots from open play and shots with the ball at the players’ feet for comparability with passes and dribbles.
Next, to connect this to the choice of actions. Like the shots data, passes and dribbles can be pinpointed to exactly where on the field they happened. So I computed the odds of a shot going in at every location in the data, regardless of type of action. In other words it’s like asking “if a player somehow found a way to shoot from where he passed/dribbled, what would be the odds of scoring?”
So far, this basically just takes the locations of all the observed offensive actions and quantifies how goal-dangerous that position was. The hard part is using it to determine the tendencies of different teams. For this, I used a technique called multinomial logistic regression (see Methodological Note 2). The basic idea is that in theory a player could try to shoot, pass or dribble from anywhere, and the choice is up to him. As the ball gets closer and closer to the goal, it becomes more likely that he will shoot, but it’s not the same for every team. Multinomial logistic regression allows me to measure the probability of each of the three actions (pass/shoot/dribble) happening as a function of field position (measured as “pseudo expected-goals” defined above) and do it for each team separately. In (even more) statistical speak, that amounts to an interaction term between team fixed effects and pseudo-expected goals.
Methodological Note 2: Further Details on Regressions
Multinomial logistic regression is just a handy statistical method for measuring the odds of one of a few options happening. If regular logistic regression measures the odds of “heads” in a coin toss, multinomial logistic regression measures the odds of each number in dice-roll (with any number of sides to the die).
The important thing is that it estimates those odds in tandem with each other so that they perfectly sum to 100%. In the first graphs above, you can literally sum the height of the three lines at any cross section along the x-axis and get 100%. Based on the principles in Methodological Note 1, the task is to do exactly that between the three offensive choices.
Back to the pseudo expected goals regression, here’s the actual regression formula:
The results are really straightforward: the closer a player is to the goal mouth, the more likely it is a shot will go in:
In other words, if the player finds a way to shoot from 1 meter closer to the attacking goal line, the odds of scoring will go up by 0.12. If he shoots from 1 meter farther away from the center line (all else equal), the odds of scoring will go down by 0.07. These may seem like small differences, but we’re talking about the whole field. I should point out that these estimates are highly significant based on p-values.