Year-to-year Correlations with pretty plots

By Matthias Kullowatz (@mattyanselmo)

As I began constructing prediction models for this season, I was faced with the obvious problem of dealing with small sample sizes. Teams have played three or four games to this point, which isn't much to go on when trying to forecast their futures. Portland, for example, has produced the fifth-best expected goal differential in the league (xGD of +0.22), but is missing its two best midfielders. I'm skeptical that the Timbers will be able to maintain that in the coming weeks. So I'm looking to last season to help me out with the beginning of this season.

Below are some heat plots depicting the correlation of six metrics to themselves. For example, if we sum each team's goals scored in its last 10 games of the past season and correlate that to its goals scored in the first ten games of this season, we get a correlation coefficient of 0.195. The highest correlations never breached 0.60, so a "red hot" correlation in the plots is about 0.60. Each of these correlations comes from a sample of 56 teams (18 in 2011-12, 19 in 2012-13 and 2013-14).

Notes

For the most part, expected goals stabilize to a greater degree than raw goals across the off-season. 

Goals Allowed is a strange metric where the number of goals allowed in a team's last game of one season--a single game!--correlates strongly to its goals allowed during the next season. My theory is that the teams that have thrown in the towel by season's tend to play more open and are likely to allow more goals toward the end of a season. Those same teams tend not to be good--that's why they're not in the playoffs--and they continue to suck in the following season.

Expected Goal Differential shows a very strong correlation across the off-season, and I'm eager to employ some previous-season xGD data in the predictions models.

Next up, I'll look at the xGD in even gamestates across the off-season, and I'm hoping to publish those prediction models by Even Better Monday (the one after Good Friday). So be on the lookout!

Do expected goals models lack style?

By Jared Young (@JaredEYoung)

Expected goals models are hip in the land of soccer statistics. If you have developed one, you are no doubt sporting some serious soccer knowledge. But it seems to be consistent across time and geography that the smart kids always lack a bit of style.

If you are reading this post you are probably at least reasonably aware of what an expected goals model is. It tells you how many goals a team should have scored given the shots they took. Analysts can then compare the goals actually scored with the goals a team was expected to score and use that insight to better understand players and teams and their abilities.

The best expected goals models incorporate almost everything imaginable about the shot. What body part did the shooter connect with? What were the exact X,Y coordinates of the shooter? What was the position of the goalie? Did the player receive a pass beforehand? Was it a set piece? All of these factors are part of the model. Like I said, they are really cool.

But as with all models of the real world, there is room for improvement. For example, expected goals models aren’t great at factoring in the number of defenders between the shooter and the goal. That could force a higher number of blocked shots or just force the shooter to take a more difficult shot than perhaps they would like to. On the opposite end of that spectrum, perhaps a shooter was wide open on a counterattack, the models would not likely recognize that situation and would undervalue the likelihood of a goal being scored. But I may have found something that will help in these instances.

I recently created a score that attempted to numerically define extreme styles of play. On the one end of the score are extreme counterattacking teams (score of 1) and on the other end are extreme possession-oriented teams (score of 7). The question is, if I overlay this score on top of expected goals models, will I find any opportunities like those mentioned above? It appears there are indeed places where looking at style will help.

I have only scored one full MLS season with the Proactive Score (PScore) so I’ll start with MLS in 2014, where I found two expected goals models with sufficient data. There is the model managed here by the American Soccer Analysis team (us!) and there is the publicly available data compiled by Michael Caley (@MC_of_A). Here is a chart of the full season’s average PScore and the difference between goals scored and expected goals scored for the ASA model and Michael Caley’s model.

Both models are pretty similar. If you were to draw a straight line regression through this data you would find nothing in particular. But allowing a polynomial curve to find a best fit reveals an interesting pattern in both charts. When the Pscores are below 3, indicating strong counterattacking play, the two models consistently under predict the number of goals scored. This makes sense given what I mentioned above; teams committed to the counterattack should find more space when shooting and should have a better chance of making their shots. Michael Caley’s model does a better job handling it, but there is still room for improvement.

It’s worth pointing out that teams that rely on the counterattack tend to be teams that consider themselves to be less talented (I repeat, tend to be). But you would think that less-talented teams would also be teams that would have shooters that are worse than average. The fact that counterattacking teams outperform the model indicates they might also be overcoming a talent gap to do so.

On the other hand, when the PScore is greater than 4, the models also underpredict the actual performance. This, however, might be for a different reason. Usually possession-oriented teams are facing more defenders when shooting. The bias here may be a result of the fact that teams that can outpossess their opponent to that level may also have the shooting talent to outperform the model.

Notice also where most teams reside, between 3 and 4. This appears to be no man’s land; a place where the uncommitted or incapable teams underperform.

Looking at teams in aggregate, however, comes with its share of bias, most notably the hypothesis I suggested for possession-oriented teams. To remove that bias, I looked at each game played in MLS in 2014, home and away, and plotted those same metrics. I did not have Michael Caley’s data by game, so I only looked at the ASA model.

For both home and away games there does appear to be a consistent bias against counterattacking teams. In games where teams produce strong counter-attacking Pscores of 1 or 2, we see them also typically outperforming expected goals (G - xG). Given that xG models are somewhat blind to defensive density it would make perfect sense that counterattacking teams shoot better than expected. By design they should have more open shots than teams that play possession soccer. It definitely appears to me that xG models should somehow factor in teams that are playing counterattacking soccer or they will under estimate goals for those teams.

What’s interesting is that same bias does not reveal itself as clearly at the other end of the spectrum, like we saw in the first graph. When looking at the high-possession teams -- the sixes and sevens -- the teams' efficiencies become murkier. If anything, it appears that being more proactive to an extreme is detrimental to efficiency (G - xG), especially for away teams. The best fit line doesn’t quite do the situation justice. When away teams are very possession-oriented with a PScore of 6 or 7, they actually underperform the ASA xG model by an average of 0.3 goals per game. That seems meaningful, and might suggest that gamestates are playing a role in confusing us. With larger samples sizes this phenomenon could be explored further, but for now it's safe to say that when a team plays a counter-attacking game, it tends to outperform its expected goals.

Focusing on home teams with high possession over the course of the season, we saw an uptick to goals minus expected goals. But It doesn’t appear the case that possession-oriented teams shoot better due to possession itself, based on the trends we saw from game to game. It seems that possession-oriented teams play that way because they have the talent to, and it’s the talent on the team that is driving them to outperform their expected goals.

So should xG models make adjustments for styles of play? It really depends on the goal of the model. If the goal is to be supremely accurate then I would say that the xG models should look at the style of play and make adjustments. However, style is something that is not specific to one shot, it looks over an entire game. Will modelers want to overlay macro conditions to their models rather than solely focus on the unique conditions of each shot?

Perhaps the model should allow this bias to continue. After all, it could reveal that counterattacking teams have an advantage in scoring as one would expect.

If the xG models look to isolate shots based on certain characteristics, perhaps they should strive to add data to each particular moment. Perhaps an aggregate overlay on counterattacks would be counterproductive as it would take the foot off the pedal of collecting better data for each shot taken. Perhaps this serves as inspiration to keep digging, keep mining for the data that helps fix this apparent bias. Perhaps it’s the impetus to shed the sweater vest and find an old worn-in pair of boots. Something a little more hip to match the intellect.

World Cup Statistics

We have begun rolling out World Cup statistics in the same format as those we provide for MLS. Scroll over "World Cup 2014" along the top bar to check it out! In the Team Stats Tables, one may observe that the recently-eliminated Spain outshot its opponents, and a much higher proportion of its possession occurred in the attacking third than that of its opponents.

Our team-by-team Expected Goals data shows that England played better than its results would suggest, earning more dangerous opportunities than its opponents. It was a matter of inches for Wayne Rooney a few times there...

 

Finishing data suggests that James Rodriguez has made the most of his opportunities---surprise, surprise---but did you know that none of Thomas Muller's first seven shots were assisted?

And despite giving up a tournament-high seven goals in the group stages, our Goalkeeping Data actually suggests that Honduran goalkeeper Noel Valladares performed admirably---especially considering the onslaught of shots he faced that were worth a tournament-most 0.4 goals per shot on target.

USA versus Ghana: Gamestates Analysis

In analyzing MLS shot data, I have learned that---with small sample sizes---how a team plays when the game is tied is a strong indication of how well it will do in future games. The US Mens National Team spent just four-and-a-half minutes tied Monday evening, the epitome of small sample sizes. In case you were curious, the US generated two shots during that time worth about 0.13 goals. Ghana did not generate a shot over those 4.5 minutes. The next most-important gamestate for a team is being ahead. With at least 17 games of data in MLS, knowing how well a team did when it was leading becomes an important piece of information for predicting that team's future success. Almost 95 minutes were spent with the US in the lead, a time in which the USMNT took six shots worth 0.5 goals to Ghana's 21 shots worth 1.7 goals.

Though MLS is definitely far below the level of even a USA-versus-Ghana match, I think a lot of the statistics from our MLS database still apply. I wrote a few weeks back about how away teams that were satisfied with the current gamestate went overboard with their conservative play. I think that could apply to the World Cup, as well. By most statistical accounts, USA versus Ghana was a fairly even matchup going in, yet the US played an annoying conservative style after going up a goal early. It gave up a majority of possession to Ghana in the attacking third, completing just 81 passes to Ghana's 171 in that zone---not to mention the US being tripled up in Expected Goals when it was ahead.

Granted, Expected Goals likely overestimates the losing team's chances of scoring. But not by much. In even gamestates in MLS, we see that teams are expected to score 1.29 goals per game, and they actually score 1.30 goals per game. Virtually no difference. However, when teams are ahead they are expected to score 1.79 goals per game, yet they only score about 1.60---an 11-percent drop. This discrepancy is likely due in large part to defenses being more packed in and capable of blocking shots. Indeed, teams that are losing have their shots blocked 27 percent of the time, while teams that are winning only have their shots blocked 22 percent of the time.

All that was simply to say that Ghana's 1.7 Expected Goals are still representative of a team that was in control---too much control for my comfort level. Even if we assume it was really about 1.5 Expected Goals against a defensive-minded American side, that still triples the USA's shot potential. Either the US strategy was overly conservative, or Ghana is really that much better. I'd like to believe in the former, but it's picking between the lesser of two evils.

It just doesn't make sense to me to play conservatively to maintain the status quo. It invariably leads to massive discrepancies in Expected Goals, and too often allows the opposition an easier way to come back.

Sporting KC still has edge in the capital

If you come in from a certain angle, you can hype this evening's DC United-Sporting KC game as the Eastern Conference's clash of the week. The two teams enter this game tied for the second seed with two of the best goal differentials in the conference. With DCU playing at home, and Sporting missing half its team, the edge would appear to go to United. But not so fast. Despite being inseparable by points, DCU and Sporting are about as far apart as two teams can be by Expected Goal Differential. Sporting sits atop the league at +0.62 per game,* while DCU is ahead of only San Jose with -0.33. If we look to even gamestates---during only those times when the score was tied and the teams were playing 11-on-11---the chasm between them grows even wider. Sporting's advantage over DCU in Even xGD is more than 1.5 goals per game.*

To this point, as early as it is in the season, I have found that winners are best predicted by Even xGD, rather than overall goal differential. Though the sample size of shots is smaller for each team in these scenarios, the information is less clouded by the various tactics that are employed when one team goes ahead, or when one team loses a player.

Of course, Sporting will be missing the likes of Graham Zusi, Matt Besler, and Lawrence Olum, as they have for the past three games. The loss of those key players has mostly coincided with their current four-game winless stretch, and it would be tempting to argue that they are not in form. However, over those last three games, Sporting overall xGD is +0.27 per game,* and its Even xGD is +0.68.*

Making predictions in sports is generally just setting oneself up for failure---especially in a sport where there are three outcomes---but I will say this. Sporting is likely better than the +180 betting line I'm seeing this morning.

*I use the phrase "per game" for simplicity, but xGD is actually calculated on a per-minute basis in our season charts. Per game implies per 96 minutes, which is the average length of an MLS game.

Calculating Expected Goals 2.0

I wrote a post similar to this a while back, outlining the process for calculating our first version of Expected Goals. This is going to be harder. Get out your TI-89 calculators, please. (Or you can just used my Expected Goals Cheatsheet). Expected Goals is founded on the idea that each shot had a certain probability of going in based on some important details about that shot. If we add up all the probabilities of a team's shots, that gives us its Expected Goals. Our goal is that this metric conveys the quality of opportunities a team earns for itself. For shooters and goal keepers, the details about the shot change a little bit, so pay attention.

The formulas are all based on a logistic regression, which allows us to sort out the influence of each shot's many details all at once. The formula changes slightly each week because we base the regression on all the data we have, including each week's new data, but it won't change by much.

Expected Goals for a Team

  • Start with -0.19
  • Subtract 0.95 if the shot was headed (0.0 if it was kicked or othered).
  • Subtract 0.74 if the shot was taken from a corner kick (by Opta definition)
  • Subtract one of the following amounts for the shot's location:
    Zone 1 - 0.0 Zone 2 - 0.93 Zone 3 - 2.37 Zone 4 - 2.68 Zone 5 - 3.55 Zone 6 - 3.06

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number "e". 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds.

Example: Shot from zone 3, header, taken off a corner kick:

-0.19 - 0.95 - 0.74 - 2.37 = -4.25

e^(-4.25) = .0143

.0143 / (1 + .0143) = 0.014 or a 1.4% chance of going in.

A team that took one of these shots would earn 0.014 expected goals.

Expected Goals for Shooter

  • Start with -0.28
  • Subtract 0.83 if the shot was headed (0.0 if it was kicked or othered).
  • Subtract 0.65 if the shot was taken from a corner kick (by Opta definition).
  • Add 2.54 if the shot was as a penalty kick.
  • Add 0.71 if the shot was taken on a fastbreak (by Opta definition).
  • Add 0.16 if the shot was taken from a set piece (by Opta definition).
  • Subtract one of the following amounts for the shot's location:
  1. 0.0
  2. 1.06
  3. 2.32
  4. 2.61
  5. 3.48
  6. 2.99

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number "e". 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds

Example: A penalty kick

-0.28 + 2.54 - 1.06 = 1.2
e^(1.2) = 3.320
3.320/ (1 + 3.320) = 0.769 or a 76.9% chance of going in.
A player that took a penalty would gain an additional 0.769 Expected Goals. If he missed, then he be underperforming his Expected Goals by 0.769.

Expected Goals for Goalkeeper

*These are calculated only from shots on target.

  • Start with 1.61
  • Subtract 0.72 if the shot was headed (0.0 if it was kicked or othered).
  • Add 1.58 if the shot was as a penalty kick.
  • Add 0.42 if the shot was taken from a set piece (by Opta definition).
  • Subtract one of the following amounts for the shot's location:
  1. One) 0.0
  2. Two) 1.10
  3. Three) 2.57
  4. Four) 2.58
  5. Five) 3.33
  6. Six) 3.21
  • Subtract 1.37 if the shot was taken toward the middle third of the goal (horizontally).
  • Subtract 0.29 if the shot was taken at the lower half of the goal (vertically).
  • Add 0.35 if the was taken outside the width of the six-yard box and was directed toward the far post.

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number "e". 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds

Example: Shot from zone 2, kicked toward lower corner, from the run of play.

1.61 - 1.10 - 0.29 = 0.22 e^(0.22) = 1.246 1.246/ (1 + 1.246) = 0.555 or a 55.5% chance of going in. A keeper that took on one of these shots would gain an additional 0.555 Expected Goals against. If he saved it, then he would be outperforming his Expected Goals by 0.555.

Frequently Asked Questions

1. Why a regression  model? Why not just subset each shot in a pivot table by its type across all variables?
I think a lot of information--degrees of freedom we call it--would be lost if I were to partition each shot into a specific type by location, pattern of play, body part, and for keepers, placement. The regression gets more information about, say, headed shots in general, rather than "headed shots from zone 2 off corner kicks," of which there are far fewer data points.
2. Why don't you include info about penalty kicks in the team model?
Penalty kicks are not earned in a stable manner. Teams that get lots of PK's early in the season are no more likely to get additional PK's later in the season. Since we want this metric to be predictive at the team level, including penalty kicks would cloud that prediction for teams that have received an extreme number of PK's thus far.
3. The formula looks quite a bit different for shooters versus for keepers. How is that possible since one is just taking a shot on the other?
There are a few reasons for this. The first is that the regression model for keepers is based only on shots on target. It is meant only to assess their ability to produce quality saves. A different data set leads to different regression results. Also, we are now accounting for the shooter's placement. It is very possible that corner kicks are finished less often than shots from other patterns of play because they are harder to place. By including shot placement information in the keeper model, the information about whether the shot came off a corner is now no longer needed for assessing the keeper's ability.
4. Why don't you include placement for shooters, then?
We wish to assess a shooter's ability to create goals beyond what's expected. Part of that skill is placement. When a shooter has recorded more goals than his expected goals, it indicates a player that is outperforming his expectation. It could be because he places well, or that he is deceptive, or he is good at getting opportunities that are better than what the model thinks. In any case, we want the expected goals to reflect the opportunities earned, and thus the actual goals should help us to measure finishing ability to some extent.

 

Looking for the model-busting formula

Well that title is a little contradictory, no? If there's a formula to beat the model then it should be part of the model and thus no longer a model buster. But I digress. That article about RSL last week sparked some good conversation about figuring out what makes one team's shots potentially worth more than those of another team. RSL scored 56 goals (by their own bodies) last season, but were only expected to score 44, a 12-goal discrepancy. Before getting into where that came from, here's how our Expected Goals data values each shot:

  1. Shot Location: Where the shot was taken
  2. Body part: Headed or kicked
  3. Gamestate: xGD is calculated in total, and also specifically during even gamestates when teams are most likely playing more, shall we say, competitively.
  4. Pattern of Play: What the situation on the field was like. For instance, shots taken off corner kicks have a lower chance of going in, likely due to a packed 18-yard box. These things are considered, based on the Opta definitions for pattern of play.

But these exclude some potentially important information, as Steve Fenn and Jared Young pointed out. I would say, based on their comments, that the two primary hindrances to our model are:

  1. How to differentiate between the "sub-zones" of each zone. As Steve put it, was the shot from the far corner of Zone 2, more than 18 yards from goal? Or was it from right up next to zone 1, about 6.5 yards from goal?
  2. How clean a look the shooter got. A proportion of blocked shots could help to explain some of that, but we're still missing the time component and the goalkeeper's positioning. How much time did the shooter have to place his shot and how open was the net?

Unfortunately, I can't go get a better data set right now so hindrance number 1 will have to wait. But I can use the data set that I already have to explore some other trends that may help to identify potential sources of RSL's ability to finish. My focus here will be on their offense, using some of the ideas from the second point about getting a clean look at goal.

Since we have information about shot placement, let's look at that first. I broke down each shot on target by which sixth of the goal it targeted to assess RSL's accuracy and placement. Since the 2013 season, RSL is second in the league in getting its shots on goal (37.25%), and among those shots, RSL places the ball better than any other team. Below is a graphic of the league's placement rates versus those of RSL over that same time period. (The corner shots were consolidated for this analysis because it didn't matter to which corner the shot was placed.)

Placement Distribution - RSL vs. League

 

RSL obviously placed shots where the keeper was not likely at: the corners. That's a good strategy, I hear. If I include shot placement in the model, RSL's 12-goal difference in 2013 completely evaporates. This new model expected them to score 55.87 goals in 2013, almost exactly the 56 they scored.

Admittedly, it isn't earth-shattering news that teams score by shooting at the corners, but I still think it's important. In baseball, we sometimes assess hitters and pitchers by their batting average on balls in play (BABIP), a success rate during specific instances only when the ball is contacted. It's obvious that batters with higher BABIPs will also have higher overall batting averages, just like teams that shoot toward the corners will score more goals.

But just because it is obvious doesn't mean that this information is worthless. On the contrary, baseball's sabermetricians have figured out that BABIP takes a long time to stabilize, and that a player who is outperforming or underperforming his BABIP is likely to regress. Now that we know that RSL is beating the model due to its shot placement, this begs the question, do accuracy and placement stabilize at the team level?

To some degree, yes! First, there is a relationship between a team's shots on target totals from the first half of the season and the second half of the season. Between 2011 and 2013, the correlation coefficient for 56 team-seasons was 0.29. Not huge, but it does exist. Looking further, I calculated the differences between teams' expected goals in our current model and teams' expected goals in this new shot placement model. The correlation from first half to second half on that one was 0.54.

To summarize, getting shots on goal can be repeated to a small degree, but where those shots are placed in the goal can be repeated at the team level. There is some stabilization going on. This gives RSL fans hope that at least some of this model-busting is due to a skill that will stick around.

Of course, that still doesn't tell us why RSL is placing shots well as a team. Are their players more skilled? Or is it the system that creates a greater proportion of wide-open looks?

Seeking details that may indicate a better shot opportunity, I will start with assisted shots. A large proportion of assisted shots may indicate that a team will find open players in front of net more often, thus creating more time and space for shots. However, an assisted shot is no more likely to go in than an unassisted one, and RSL's 74.9-percent assist rate is only marginally better than the league's 73.1 percent, anyway. RSL actually scored about six fewer goals than expected on assisted shots, and six more goals than expected on unassisted shots. It becomes apparent that we're barking up the wrong tree here.*

Are some teams more capable of not getting their shots blocked? If so then then those teams would likely finish better than the league average. One little problem with this theory is that RSL gets it shots blocked more often than the league average. Plus, in 2013, blocked shot percentages from the first half of the season had a (statistically insignificant) negative correlation to blocked shots in the second half of the season, suggesting strongly that blocked shots are more influenced by randomness and the defense, rather than by the offense which is taking the shots.

Maybe some teams get easier looks by forcing rebounds and following them up efficiently. Indeed, in 2013 RSL led the league in "rebound goals scored" with nine, where a rebounded shot is one that occurs within five seconds of the previous shot. That beat their expected goals on those particular shots by 5.6 goals. However, earning rebounds does not appear to be much of a skill, and neither does finishing them. The correlation between first-half and second-half rebound chances was a meager--and statistically insignificant--0.13, while the added value of a "rebound variable" to the expected goals model was virtually unnoticeable. RSL could be the best team at tucking away rebounds, but that's not a repeatable league-wide skill. And much of that 5.6-goal advantage is explained by the fact that RSL places the ball well, regardless of whether or not the shot came off a rebound.

Jared did some research for us showing that teams that get an extremely high number of shots within a game are less likely to score on each shot. It probably has something to do with going for quantity rather than quality, and possibly playing from behind and having to fire away against a packed box. While that applies within a game, it does not seem to apply over the course of a season. Between 2011 and 2013, the correlation between a teams attempts per game and finishing rate per attempt was virtually zero.

If RSL spends a lot of time in the lead and very little time playing from behind--true for many winning teams--then its chances may come more often against stretched defenses. RSL spent the fourth most minutes in 2013 with the lead, and the fifth fewest minutes playing from behind. In 2013, there was a 0.47 correlation between teams' abilities to outperform Expected Goals and the ratio of time they spent in positive versus negative gamestates.

If RSL's boost in scoring comes mostly from those times when they are in the lead, that would be bad news since their Expected Goals data in even gamestates was not impressive then, and is not impressive now. But if the difference comes more from shot placement, then the team could retain some of its goal-scoring prowess. 8.3 goals of that 12-goal discrepancy I'm trying to explain in 2013 came during even gamestates, when perhaps their ability to place shots helped them to beat the expectations. But the other 4-ish additional goals likely came from spending increased time in positive gamestates. It is my guess that RSL won't be able to outperform their even gamestate expectation by nearly as much this season, but at this point, I wouldn't put it past them either.

We come to the unsatisfying conclusion that we still don't know exactly why RSL is beating the model. Maybe the players are more skilled, maybe the attack leaves defenses out of position, maybe it spent more time in positive gamestates than it "should have." And maybe RSL just gets a bunch of shots from the closest edge of each zone. Better data sets will hopefully sort this out someday.

*This doesn't necessarily suggest that assisted shots have no advantage. It could be that assisted shots are more commonly taken by less-skilled finishers, and that unassisted shots are taken by the most-skilled finishers. However, even if that is true, it wouldn't explain why RSL is finishing better than expected, which is the point of this article.

ASA Podcast XLIV: The One Where We Talk About What We Write About

Harrison and Matty discuss their two most recent articles, respectively about Harrison's Shots Created per 90 statistic and Matty's obsessive need to put RSL down because its players are more gooder at soccer than he is. It's a short one, perfect for your commute![mixcloud http://www.mixcloud.com/hkcrow/asa-podcast-xliv-the-one-where-we-talk-about-what-we-write/ width=660 height=180 /]

Real Salt Lake: Perennial Model Buster?

If you take a look back at 2013's expected goal differentials, probably the biggest outlier was MLS Cup runner up Real Salt Lake. Expected to score 0.08 fewer goals per game than its opponents, RSL actually scored 0.47 more goals than its opponents. That translates to a discrepancy of about 19 unexplained goals for the whole season. This year, RSL finds itself second in the Western Conference with a goal differential of a massive 0.80. However, like last year, the expected goal differential is lagging irritatingly behind at --0.77. There are two extreme explanations for RSL's discrepancy in observed versus expected performance, and while the truth probably lies in the middle, I think it's valuable to start the discussion at the extremes and move in from there.

It could be that RSL plays a style and has the personnel to fool my expected goal differential statistic. Or, it could be that RSL is one lucky son of a bitch. Or XI lucky sons of bitches. Whatever.

Here are some ways that a team could fool expected goal differential:

  1. It could have the best fucking goalkeeper in the league.
  2. It could have players that simply finish better than the league average clip in each defined shot type.
  3. It could have defenders that make shots harder than they appear to be in each defined shot type--perhaps by forcing attackers onto their weak feet, or punching attackers in the balls whilst winding up.
  4. That's about it.

We know are pretty sure that RSL does indeed have the best goalkeeper in the league, and Will and I estimated Nick Rimando's value at anywhere between about six and eight goals above average* during the 2013 season. That makes up a sizable chunk of the discrepancy, but still leaves at least half unaccounted for.

The finishing  ability conversation is still a controversial one, but that's where we're likely to see the rest of the difference. RSL scored 56 goals (off their own bodies rather than those of their opponents), but were only expected to score about 44. That 12-goal difference can be conveniently explained by their five top scorers--Alvaro Saborio, Javier Morales, Ned Grabavoy, Olmes Garcia, and Robbie Findley--who scored 36 goals between them while taking shots valued at 25.8 goals. (see: Individual Expected Goals, and yes it's biased to look at just the top five goal scorers, but read on.)

Here's the catch, though. Using the sample of 28 players that recorded at least 50 shots last season and at least 5 shots this season, the correlation coefficient for the goals above expectation statistic is --0.43. It's negative. Basically, players that were good last year have been bad this year, and players that were bad last year have been good this year. That comes with some caveats--and if the correlation stays negative then that is a topic fit for another whole series of posts--but for our purposes here it suggests that finishing isn't stable, and thus finishing isn't really a reliable skill. The fact that RSL players have finished well for the last 14 months means very little for how they will finish in the future.

Since I said there was a third way to fool expected goal differential--defense. I should point out that once we account for Rimando, RSL's defense allowed about as many goals as expected. Thus the primary culprits of RSL's ability to outperform expected goal differential have been Nick Rimando and its top five scorers. So now we can move on to the explanation on the other extreme, luck.

RSL has been largely lucky, using the following definition of lucky: Scoring goals they can't hope to score again. A common argument I might expect is that no team could be this "lucky" for this long. If you're a baseball fan, I urge you to read my piece on Matt Cain, but if not, here's the point. 19 teams have played soccer in MLS the past two seasons. The probability that at least one of them gets lucky for 1.2 seasons worth of games is actually quite high. RSL very well may be that team--on offense, anyway.

Unless RSL's top scorers are all the outliers--which is not impossible, but unlikely--then RSL is likely in for a rude awakening, and a dogfight for a playoff spot.

 

*Will's GSAR statistic is actually Goals Saved Above Replacement, so I had to calibrate.

ASA Podcast XLIII: The one where Matty Makes the Call

Hey everyone, here is our latest terrible exhilarating podcast for your listening pleasure. The delay this week in posting was largely due to us switching to 'Mixcloud' for the conceivable hosting future as we move way from our current site and into a domain of our own. Admittedly, we ate up a good 15 minutes in the start of the podcast talking about the Seattle-Portland match, but you saw that coming...right? The rest of the podcast is also solid, and perhaps more importantly, less Cascadia-specific, so don't give up on it just because of that segment! [mixcloud http://www.mixcloud.com/hkcrow/asa-podcast-xliii-the-one-where-matty-gives-the-call/ width=660 height=180 /]