Editor's note: This is the first in a multi-part series dealing with how to measure defenders using data. Check back for the next installment in a couple weeks.
By Kareem Williams (@kareemw9)
As the majority of us know, Major League Soccer has taken tremendous strides to improve the game over the past decade or so. There has been a strong drive to bring in aging stars such as David Beckham (31 years old), David Villa (32 years old), Thierry Henry (33 years old), and Kaka (32 years old). The premise of this strategy is that these stars, who are slightly past their prime, will be able to improve the fan base and the overall appeal of Major League Soccer. The majority of these players are attacking superstars, mainly because for fans, the most exciting part of the game always occurs in the scoring of goals. However games are not necessarily won by scoring a ton of goals, but rather (to loosely quote Louis Van Gaal) to “score one more goal than your opponent.”
An often-neglected area of the game is improving the defensive quality of a team. Below is a table that shows the number of players per position and their salary information by position.
As you can see, defenders’ pay pales in comparison to their attacking teammates’ in regards to salary. As a result, I have decided to dive into the defensive statistics of the top MLS teams (MLS teams in the Conference Semifinals). Unsurprisingly, all of the Western Conference and all but one of the Eastern semifinalists were among the top defensive teams in the league:
Seattle Sounders - 36 goals conceded
Vancouver Whitecaps – 36 goals conceded
FC Dallas – 39 goals conceded
Portland Timbers – 39 goals conceded
Montreal Impact – 44 goals conceded
DC United – 45 goals conceded
New England Revolution – 47 goals conceded
Columbus Crew – 53 goals conceded (2nd highest goals scored with 58)
This will be a statistical journey using SAS and Processing to create an expected non-goals model that incorporates various key defensive statistics in an effort to have the defensive version of Expected Goals.
For this exploratory analysis I conducted a Principal Component Analysis as the preliminary analysis and then a Principal Component Multiple Linear Regression with 14 variables predicting for away goals.
My original 14 variables were:
Homegoals: These are goals scored by the team analyzed. As mentioned above, it is important because in order to win a game a team must outscore the opponent.
AwayAttempts: These are the amount of shots an opposition team attempts. These are important because an attempt can lead to a goal scored.
AwaySOG. These are the amount of shots on goal/target. A shot on goal can result in a goal scored whereas a shot off target has a 0% chance of scoring.
HomeBlocks: These are the amount of blocked shot attempts by the analyzed team.
Away Corners: These are the amount of corners which can lead to goals by the opponents.
Away Cross: These are the amount of crosses by the opponents which can lead to goals.
Away Offsides : The amount of times the defensive unit of the analyzed team plays the opponents offside (eliminates an opportunity to score a goal).
Home Fouls: The amount of fouls performed by the analyzed team. This is important as a foul can lead to a free kick/opportunity to score a goal. So generally a team will want to minimize fouls.
Home Yellow: the amount of yellow cards awarded to the analyzed teams.
Home Red: The amount of red cards awarded to the analyzed team. With a red card there is 1 fewer(per red card) player on the field, so it becomes harder to defend and attack.
Home duels/away duels: This is a coefficient that compares the amount of duels won by the analyzed team vs opposition.
Home Tackles: The amount of tackles won by the analyzed team.
Home Clearances: The amount of times the analyzed team successfully clears the ball out of their danger zone (18 yard box)
Away PassPct: The percentage of passes the opponent makes with the passes. A lower number is desirable as that means the analyzed team is intercepting more of the opponent’s passes.
The Correlation Matrix was not promising as the correlation among the various variables were few and far between. Nevertheless there were a few decent correlations such as away cross + home clearances (63%) and away attempts + away corners (44%).
Moving forward, the first five principal components explained over 60% of the variance. With the five principal components I ran the Principal Component Multiple Linear Regression which showed some very interesting results.
At the first look, one would be put off by the low R squared (29%), however the model has an acceptable p-value of (Pr>f = .0004). Additionally the model rejects the Principal Components 1, 2, and 5 leaving only principal components 3 and 4. As a result, we have a rough regression equation that reads:
Expected Away Goals (xAG) = Intercept + y3Prin3 + y4P4
Prin3 is essentially = “Home Goals”, “Away Shots on Goal”, “Home Blocks”, and “Away Offsides”
Prin4 is essentially = “Home Red Cards”, “Away Pass Percentage”.
As a result, our first foray into explaining key defensive factors in creating an expected opponent goals model has resulted in a promising first model. Although it isn’t the prettiest and has a few holes. We now know that out of the original 14 variables looked at, only six factors are needed for further analysis.
Be on the lookout for the next article as I’ll continue this model and hopefully will have a stronger R squared value. Additionally, we will be looking into exact scenarios of how goals were scored i.e. from corners, open play, and free kicks.