How it Happened: Week Three

In the three games I watched this week, five goals were scored. Two were from penalty kicks, and two were off corner kicks. Needless to say, offenses around the league are in early-season form, i.e. not exactly clicking in front of the net. On the bright side, there was a decent amount of combination play leading to chances....it's just that whole putting them away thing that MLS teams are still working on. Onto the main attraction: Chicago Fire 1 - 1 New York Red Bulls

Stat that told the story for New York: 350 completed passes; 68% of which were on the left side of the field*

nyrb3

It's hardly inspiring for the Supporters' Shield holders to sneak away from Chicago with a draw, but I actually thought they played pretty well on Sunday. Like I said above about the league as a whole, quality was missing on the final ball/shot, but New York fans shouldn't be too worried about the team's winless start. In this one there was quite a bit of good linking-up, particularly on the left flank. Given that midfielder Matt Watson was starting in a pinch as a nominal right back for the Fire, it seemed like a concerted effort from RBNY to expose a weakness on that side of the field. Between Roy Miller, Jonny Steele and Thierry Henry, there were some encouraging sequences down that side in particular; unfortunately for New York it didn't lead to any actual goals.

*This stat/image is blatantly stolen from the Twitter account of MLS Fantasy Insider Ben Jata, @Ben_Jata. After seeing it this weekend, I was unable to think of anything better to include, so thanks, Ben!

Stat that told the story for Chicago: 24 total shots + key passes, only 2 of which were from Mike Magee

I'm not sure if this one is a good stat for Chicago fans or a bad one, but Mike Magee was conspicuously absent from a lot of the action this weekend (unless you count yelling incessantly and childishly at the ref as your definition of 'action'). But seriously: last year Chicago had 377 shots the entire season, and Magee either took or assisted on 116 of them (31%)*. Oh, and he only played 22 of their 34 games. The fact that he was involved in only 2 of the team's 24 shots (both of his shots were blocked, for what it's worth) could certainly be viewed as concerning for Chicago fans expecting another MVP-caliber season out of Magee. But on the other hand, it's easy to chalk up the struggles to the fact that this was his first game of the season after a maybe-contract-hold-out related hiatus. Also, the fact that Chicago managed to create 22 shots without Magee's direct influence (or Patrick Nyarko and Dilly Duka, both also out this weekend) has to be a good sign for a team that was often a one-man show last season: youngsters Harrison Shipp and Benji Joya in particular both seem capable of lightening the load.

*Numbers from Squawka.

 

Toronto FC 1 - 0 DC United

Stat that told the story for Toronto: 38% possession, 3 points won

tfc3

TFC captain Michael Bradley made headlines this week saying something along the lines of how possession was an overrated stat, and his team certainly appears to be trying to prove his point so far this season. The Reds didn't see a ton of the ball in their home opener, instead preferring to let DC knock the ball around with minimal penetration in the final third. And then when Toronto did win the ball, well, check out the Opta image that led to the game's lone goal for Jermain Defoe (or watch the video). It started with a hopeful ball from keeper Julio Cesar. The second ball was recovered by Steven Caldwell, who fed Jonathan Osorio. Osorio found his midfield partner Bradley, who lofted a brilliant 7-iron to fellow DP Gilberto. The Brazilian's shot was saved but stabbed home by the sequence's final Designated Player, Defoe. Balls like that one were played multiple times throughout the game by both Bradley and Osorio, as TFC has shown no aversion to going vertical quickly upon winning the ball. And with passes like that, speedy wingers, and quality strikers, it's certainly a strategy that may continue to pay off.

Stat that told the story for DC: 1/21 completed crosses

This stat goes along a bit with what I wrote about Toronto above: they made themselves hard to penetrate in the final third, leading to plenty of incomplete crosses. Some of this high number of aimless crosses also comes from the fact that DC was chasing an equalizer and just lumping balls into the box late in the match. Still, less than 5% on completing crosses is a bit of a red flag when you look at the stat sheet. Particularly when your biggest attacking threat is Eddie Johnson, who tends to be at his best when attacking balls in the air. You'd think Ben Olsen would expect a better crossing percentage. To be fair to United though, I thought they were much better in this game than they were on opening day against Columbus. They looked about 4 times more organized than two weeks ago, and about 786 times more organized than last season, and their possession and link-up play showed signs of improvement too. Still a ways to go, but at least things are trending upward for the Black and Red.

 

Colorado Rapids 2 - 0 Portland Timbers

Stat that told the story for Portland: 1 Donovan Ricketts karate kick

por3

I admit that I'm cheating here and not using a stat or an Opta Chalkboard image. But the above grainy screenshot of my TV that I took is too hilarious and impactful not to include. Colorado and Portland played a game on Saturday that some might call turgid, or testy, or any number of adjectives that are really stand-ins for the word boring. The most interesting parts of most of the game were Ricketts' adventures in goal, which ranged from dropping floated long balls to tipping shots straight in the air to himself. In the 71st minute it appeared Ricketts had had enough and essentially dropped the mic. Flying out of his net, he leapt into the air with both feet, apparently hoping that if he looked crazy enough the ref would look away in horror instead of red carding him for the obvious kick to Deshorn Brown's chest. The Rapids converted the penalty and then added another one a few minutes later, and that was all she wrote.

Stat that told the story for Colorado: 59 total interceptions/recoveries/tackles won; 27 in the game's first 30 minutes

Alright, I was silly with the Portland section so I feel like I need to do a little serious analysis for this paragraph. The truth is that this game was fairly sloppy on both sides, which is particularly surprising considering how technically proficient Portland was for most of last season. But cold weather combined with early season chemistry issues makes teams play sloppily sometimes, and it didn't help that Colorado came out and looked very good to start this game. Their defensive shape was very compact when the Timbers had the ball, and the Rapids were very proficient in closing down passing lanes and taking possession back. The momentum swung back to Portland's side and back a couple of times throughout the match, but Colorado's strong start set the tone that Donovan Ricketts helped carry to the final whistle.

 

Agree with my assessments? Think I'm an idiot? I always enjoy feedback. Contact me on twitter @MLSAtheist or by email at MLSAtheist@gmail.com

MLS Week 3: Expected Goals and Attacking Passes

In the coming days, Matthias will be releasing our Expected Goals 2.0 statistics for 2014. You can find the 2013 version already uploaded here. I would imagine that basically everything I've been tweeting out from our @AnalysisEvolved twitter handle about expected goals up to this point will be certainly less cool, but he informs me it won't be entirely obsolete. He'll explain when he presents it, but the concept behind the new metrics are familiar, and there is a reason why I use xGF to describe how teams performed in their attempt to win a game. It's important to understand that there is a difference between actual results and expected goals, as one yields the game points and the other indicates possible future performances. However, this post isn't about expected goal differential anyway--it's about expected goals for. Offense. This obviously omits what the team did defensively (and that's why xGD is so ideal in quantifying a team performance), but I'm not all about the team right now. These posts are about clubs' ability to create goals through the quality of their shots. It's a different method of measurement than that of PWP, and really it's a measuring something completely different.

Take for instance the game which featured Columbus beating Philadelphia on a couple of goals from Bernardo Anor, who aside from those goals turned in a great game overall and was named Chris Gluck's attacking player of the week. That said, know that the goals that Anor scored are not goals that can be consistently counted upon in the future. That's not to diminish the quality or the fact that they happened. It took talent to make both happen. They're events---a wide open header off a corner and a screamer from over 25 yards out---that I wouldn't expect him to replicate week in and week out.

Obviously Columbus got some shots and in good locations which they capitalized on, but looking at the xGF metric tells us that while they scored two goals and won the match, the average shot taker would have produced just a little more than one expected goal. Their opponents took a cumulative eleven shots inside the 18 yard box, which we consider to be a dangerous location. Those shots, plus the six from long range, add up to nearly two goals worth of xGF. What this can tell us is two pretty basic things 1) Columbus scored a lucky goal somewhere (maybe the 25 yard screamer?) and then 2) They allowed a lot of shots in inopportune locations and were probably lucky to come out with the full 3 points.

Again, if you are a Columbus Crew fan and you think I'm criticizing your team's play, I'm not doing that. I'm merely looking at how many shots they produced versus how many goals they scored and telling you what would probably happen the majority of the time with those specific rates.

 

 Team shot1 shot2 shot3 shot4 shot5 shot6 Shot-total xGF
Chicago 1 3 3 3 3 0 13 1.283
Chivas 0 3 2 2 3 0 10 0.848
Colorado 1 4 4 2 1 1 13 1.467
Columbus 0 5 1 2 1 0 9 1.085
DC 0 0 1 1 4 0 6 0.216
FC Dallas 0 6 2 0 1 1 10 1.368
LAG 0 0 4 2 3 0 9 0.459
Montreal 2 4 5 8 7 0 26 2.27
New England 1 2 1 8 5 0 17 1.275
New York 2 4 2 0 2 0 10 1.518
Philadelphia 2 5 6 2 4 0 19 2.131
Portland 0 0 2 2 2 1 7 0.329
RSL 0 4 3 0 3 0 10 0.99
San Jose 0 2 0 0 3 0 5 0.423
Seattle 1 4 0 2 2 0 9 1.171
Sporting 2 6 2 2 3 2 17 2.071
Toronto 0 6 4 2 2 0 14 1.498
Vancouver 0 1 1 3 3 0 8 0.476
 Team shot1 shot2 shot3 shot4 shot5 shot6 Shot-total xGF

Now we've talked about this before, and one thing that xGF, or xGD for that matter, doesn't take into account is Game States---when the shot was taken and what the score was. This is something that we want to adjust for in future versions, as that sort of thing has a huge impact on the team strategy and the value of each shot taken and allowed. Looking around at other instances of games like that of Columbus, Seattle scored an early goal in their match against Montreal, and as mentioned, it changed their tactics. Yet despite that, and the fact that the Sounders only had 52 total touches in the attacking third, they were still able to average a shot per every 5.8 touches in the attacking third over the course of the match.

It could imply a few different things. Such as it tells me that Seattle took advantage of their opportunities in taking shots and even with allowing of so many shots they turned those into opportunities for themselves. They probably weren't as over matched it might seem just because the advantage that Montreal had in shots (26) and final third touches (114). Going back to Columbus, it seems Philadelphia was similar to Montreal in the fact that both clubs had a good amount of touches, but it seems like the real difference in the matches is that Seattle responded with a good ratio of touches to shots (5.77), and Columbus did not (9.33).

These numbers don't contradict PWP. Columbus did a lot of things right, looked extremely good, and dare I say they make me look rather brilliant for picking them at the start of the season as a possible playoff contender. That said their shot numbers are underwhelming and if they want to score more goals they are going to need to grow a set and take some shots.

 Team att passes C att passes I att passes Total Shot perAT Att% KP
Chicago 26 17 43 3.308 60.47% 7
Chivas 32 29 61 6.100 52.46% 2
Colorado 58 27 85 6.538 68.24% 7
Columbus 53 31 84 9.333 63.10% 5
DC 61 45 106 17.667 57.55% 3
FC Dallas 34 26 60 6.000 56.67% 2
LAG 43 23 66 7.333 65.15% 6
Montreal 63 51 114 4.385 55.26% 11
New England 41 29 70 4.118 58.57% 7
New York 57 41 98 9.800 58.16% 6
Philadelphia 56 29 85 4.474 65.88% 10
Portland 10 9 19 2.714 52.63% 3
RSL 54 32 86 8.600 62.79% 3
San Jose 37 20 57 11.400 64.91% 3
Seattle 33 19 52 5.778 63.46% 5
Sporting 47 29 76 4.471 61.84% 7
Toronto 30 24 54 3.857 55.56% 6
Vancouver 21 20 41 5.125 51.22% 2
 Team att passes C att passes I att passes Total ShotpT Att% KP

There is a lot more to comment on than just Columbus/Philadelphia and Montreal/Seattle (Hi Portland and your 19 touches in the final third!). But these are the games that stood out to me as being analytically awkward when it comes to the numbers that we produce with xGF, and I thought they were good examples of how we're trying to better quantify the the game. It's not that we do it perfect---and the metric is far from perfect---instead it's about trying to get better and move forward with this type of analysis, opposed to just using some dried up cliché to describe a defense, like "that defense is made of warriors with steel plated testicles" or some other garbage.

This is NUUUUUuuuuummmmmbbbbbbeeerrrs. Numbers!

MLS Possession with Purpose Week 3: The best (and worst) performances

Here's my weekly analysis for your consideration as Week 3 ended Sunday evening with a 2-nil Seattle victory over Montreal. To begin, for those new to this weekly analysis, here's a link to PWP. It includes an introduction and some explanations; if you are familiar with my offerings then let's get stuck in.

First up is how all the teams compare to each other for Week 3:

Observations:

Note that Columbus remains atop the League while those who performed really well last year (like Portland) are hovering near the twilight zone. A couple of PKs awarded to the opponent and some pretty shoddy positional play defensively have a way of impacting team performance.

Note also that Toronto are mid-table here but not mid-table in the Eastern Conference standings; I'll talk more about that in my Possession with Purpose Cumulative Blog later this week.

Also note that Sporting Kansas City are second in the queue for this week; you'll see why a bit later.

A caution however - this is just a snapshot of Week 3; so Houston didn't make the list this week but will surface again in my Cumulative Index later.

The bottom dweller was not DC United this week; that honor goes to Philadelphia. Why? Well, because like the previous week, their opponent (Columbus) is top of the heap.

So how about who was top of the table in my PWP Strategic Attacking Index? Here's the answer for Week 3:

As noted, Columbus was top of the Week 3 table again this week, with FC Dallas and their 3-1 win against Chivas coming second, and Keane and company for LA coming third.

With Columbus taking high honors, and all the press covering Bernardo Anor, it is no surprise he took top honors in the PWP Attacking Player of the Week. But he didn't take top honors just for his two wicked goals, and the diagram below picks out many of his superb team efforts as Columbus defeated Philadelphia 2-1.

One thing to remember about Bernardo; he's a midfielder and his game isn't all about scoring goals. Recoveries and overall passing accuracy play a huge role in his value to Columbus, and with 77 touches he was leveraged quite frequently in both the team's attack and defense this past weekend.

Anyhoo... the Top PWP Defending Team of the Week was Sporting Kansas City. This is a role very familiar to Sporting KC, as they were the top team in defending for all of MLS in 2013. You may remember that they also won the MLS Championship, showing that a strong defense is one possible route to a trophy.

Here's the overall PWP Strategic Defending Index for your consideration:

While not surprising for some, both New England and Vancouver finished 2nd and 3rd respectively; a nil-nil draw usually means both defenses performed pretty well.

So who garnered the PWP Defending Player of the Week?  Most would consider Aurelien Collin a likely candidate, but instead I went with Ike Opara, as he got the nod to start for Matt Besler.  Here's why:

Although he recorded just two defensive actions inside the 18-yard box compared to five for Collin, Opara was instrumental on both sides of the pitch in place of Besler. All told, as a Center-back, his defensive activities in marshaling the left side were superb as noted in the linked MLS chalkboard diagram here. A big difference came in attack where Opara had five shots attempts with three on target.

In closing...

My thanks again to OPTA and MLS for their MLS Chalkboard; without which this analysis could not be offered.

You can follow me on twitter @chrisgluckpwp, and also, when published you can read my focus articles on the New York Red Bulls PWP this year at the New York Sports Hub. My first one should be published later this week.

All the best, Chris

Calculating Expected Goal Differential 1.0

The basic premise of expected goal differential is to assess how dangerous a team's shots are, and how dangerous its opponent's shots are. A team that gets a lot of dangerous shots inside the box, but doesn't give up such shots on defense, is likely to be doing something tactically or skillfully, and is likely to be able to reproduce those results.

The challenge to creating expected goal differential (xGD), then, is to obtain data that measures the difficulty of each shot all season long. Our xGD 1.0 utilized six zones on the field to parse out the dangerous shots from those less so. Soon, we will create xGD 2.0 in which shots are not only sorted by location, but also by body part (head vs. foot) and by run of play (typical vs. free kick or penalty). Obviously kicked shots are more dangerous than headed shots, and penalty kicks are more dangerous than other shots from zone two, the location just behind the six-yard box.

So now, for the calculations.

Across the entire league, for all 8,291 shots taken in 2013, we calculate the proportion of shots from each zone that were finished (scored):

Location Goals Shots Finish%
One 129 415 31.1%
Two 451 2547 17.7%
Three 100 1401 7.1%
Four 85 1596 5.3%
Five 51 2190 2.3%
Six 5 142 3.5%

We see that shots from zones one and two are the most dangerous, while shots from farther out or from wider angles are less dangerous. To calculate a team's offensive "dangerousness," we count the number of shots each team attempted from each zone, and then multiply each total by the league's finishing rate. As an example, here we have Sporting Kansas City's offensive totals:

Locations Goals Attempts Finish% ExpGoals
One 5 18 31.1% 5.6
Two 29 160 17.7% 28.3
Three 5 78 7.1% 5.6
Four 3 97 5.3% 5.2
Five 2 120 2.3% 2.8
Six 1 17 3.5% 0.6
Total 45 490 9.2% 48.1

Offensively, if SKC had finished at the league average rate from each respective zone, then it would have scored about 48 goals. Now let's focus on SKC's defensive shot totals:

Locations Goals Attempts Finish% ExpGoals
One 4 13 31.1% 4.0
Two 17 95 17.7% 16.8
Three 4 54 7.1% 3.9
Four 4 56 5.3% 3.0
Five 1 84 2.3% 2.0
Six 0 4 3.5% 0.1
Total 30 306 9.8% 29.8

Defensively, had SKC allowed the league average finishing rate from each zone, it would have allowed about 30 goals (incidentally, that's exactly what it did allow, ignoring own goals).

Subtracting expected goals against from expected goals for, we get a team's expected goal differential. Expected goal differential works so well as a predictor because teams are more capable of repeating their ability to get good (or bad) shots for themselves, and allow good (or bad) shots to their opponents. An extreme game in which a team finishes a high percentage of shots won't sway that team's xGD, nor that of its opponents, making xGD a better indicator of "true talent" at the team level.

As for xGD 2.0, coming soon to a laptop near you, the main difference is that there will be additional shot types to consider. Instead of just six zones, now there will be six zones broken down by headed and kicked shots (12 total zones) in addition to free kick---and possibly even penalty kick---opportunities (adding, at most, four more shot types). As with xGD 1.0, a team's attempts for each type of shot will be multiplied by the league's average finishing rates, and then those totals will be summed to find expected goals for and expected goals against.

Does last season matter? - Follow Up

I wrote a few weeks ago about the weak predictive information contained in a team's previous season of data. When trying to predict a team's goal differential in the second 17 games of a season, it was the first 17 games of that same season that did the job. The previous season's data was largely unhelpful. @sea_soc tweeted me the following:

https://twitter.com/sea_soc/status/406507942179905537

Ask, and you shall receive. Here's the weird shit I found when trying to project a seasons second-half goal differential:

Stat Coef. P-Value
Intercept -33.6 0.86%
AttemptDiff (first 17) 0.1 0.00%
Finish Diff (first 17) 90.6 0.12%
Attempt Diff (first 17 last season) 0.1 2.88%
Attempt Diff (second 17 last season 0.0 20.00%
Finish Diff (first 17 last season) 115.0 7.08%
Finish Diff (second 17 last season) -23.5 28.81%
Home Games Left 4.0 0.81%

Translation: Strangely, it's the first part of the previous season that is the better predictor of future performance. Not the second part of last season, which actually happened more recently. In fact, information from the second part of each team's previous season produced negative coefficients (negative relationships). Weird.

Now let's change the response variable slightly to be a team's goal differential from its first 17 games. Which does better at predicting, last season's first half or last season's second half?

Neither. In fact, there was nothing that came close to predicting the first halves of 2012 and 2013.

Stat Coef. P-value
Intercept 18.9 20.3%
Finish Diff (first 17 last season) -5.5 94.5%
Finish Diff (second 17 last season) 5.9 60.9%
Attempt Diff (first 17 last season) 0.01 26.6%
Attempt Diff (second 17 last season) 0.04 32.5%
Home Games (first 17 this season) -2.2 20.3%

With such small sample sizes, it could be there is just something really weird about the first halves, especially 2013. I say "especially 2013" because 2011 and 2012's first halves seemed to do a fair job of projecting the next season's second halves, so it's 2013 that seems screwy. Portland and Seattle performed opposite of what would have been expected for each, for example, while D.C. United and Montreal did the same confusing switcheroo in the Eastern Conference to kick off the 2013 campaign. So it could have just been weird randomness.

In the end, I'm quite certain of one thing, and that's that I'm still confused.

World Cup Draws: United States, Mexico, and the Netherlands

Of those three teams, it's the United States's draw that incites the least of my frustration.

Search for "world cup draw" on Google, and you'll find mostly opinions that the U.S. Mens National Team found itself in the group of death, as if there can only be one. But as many pointed out before the draw, the USMNT was not likely to get into an easier group. Coming from Pot 3, the USMNT was at a disadvantage already due to being in the weakest pot. Using ratings from Nate Silver's Soccer Power Index (SPI), here are the average ratings by each of the four pots:

Pot Rating Standard Dev.
1 85.9 5.0
4 79.7 3.3
2 76.2 8.0
3 73.7 3.5

Since teams from the same pot could not meet in the group stage, the USMNT couldn't draw any teams from its own pot. Thus it automatically got zero chance at playing some of the weaker teams in the opening round, leaving us praying for one of Switzerland or Belgium from the ranked Pot 1 to ease our path to glory (no such luck).  Additionally, all Pot 3 teams got a slightly higher chance of meeting two European teams in the group stages due to that additional UEFA team moving from Pot 4 to Pot 2. Pot 3 teams eluding a European team from Pot 1 may still have gotten Italy or England (I can't tell which one) from Pot 2. Costa Rica drew the short straw on that one.

If you look at Nate Silver's  ratings, you'll notice that most Pot 3 teams got pretty raw deals. Below are the chances that each team advances to the knockout round, as well as the average ratings for the other teams in their respective groups. Pot 3 teams are bold and italicized, and data came from Silver's own model.

Team Difficulty Knockout   Team Difficulty Knockout
Australia 86.6 2.0%   Italy 81.0 44.2%
Algeria 77.1 11.4%   Mexico 78.9 45.3%
Iran 81.8 18.9%   Ivory Coast 78.7 49.8%
Honduras 81.2 20.4%   Bosnia 79.3 52.6%
Cameroon 80.6 22.3%   England 80.3 57.5%
Japan 80.4 24.2%   Ecuador 78.2 64.7%
Costa Rica 82.2 28.8%   Uruguay 79.6 69.5%
Ghana 81.9 28.8%   Russia 71.6 72.6%
Nigeria 80.6 31.2%   Chile 79.9 74.3%
Croatia 79.7 32.9%   France 77.3 78.4%
Switzerland 79.7 36.5%   Belgium 71.1 79.1%
South Korea 73.8 36.9%   Spain 79.4 82.8%
United States 81.2 39.3%   Colombia 76.2 86.5%
Portugal 81.1 39.3%   Germany 78.0 91.8%
Greece 79.3 39.5%   Argentina 75.6 97.3%
Netherlands 81.3 41.0%   Brazil 73.9 99.6%

Relative to its stature in the world---17th best according to the SPI---the United States drew arguably the second-hardest group of opponents, second only to the Netherlands*. Though the USMNT may be in a group of death, the Netherlands are definitely in the group of death---and on the outside looking in. But it's our neighbor to the south that draws the most frustration. In terms of average group difficulty, the only North American side to get a relatively decent draw was Mexico. Mexico will just have to be better than Croatia and Cameroon in the group stage. Even after pissing all over themselves in CONCACAF qualifying, the Mexicans now have the easiest path of any Pot 3 team.

The Dutch side is the ninth-best in the tournament by the SPI, and yet it drew two of the best teams in the Cup, Chile and Spain. The Oranje, the team of my birth country, have been left sadly with just a 41-percent chance at making the knockout stage. The Mexican side is ranked 26th in the world, finished fourth in qualifying, and has a better chance to advance than the Netherlands.

Oh, FIFA.

*While Australia, Iran and Costa Rica all drew harder opponents on average than the USMNT, they were not as highly ranked themselves as the USMNT. In other words, it was expected that worse teams would get tougher opponents because they don't get to play themselves.

Sporting exceptional at home; RSL lame on the road

It is true that Sporting has had trouble getting points at home. SKC earned 30 points at Sporting Park this year, good for 13th in a league of 19 teams. Based on that information alone, some will argue that Sporting is not a good home team. One of those people is Simon Borg, who justifies his viewpoint by pointing out that SKC lost five times at home, as though that matters. It doesn't.

I've shown that past points simply don't correlate well to future points. With information like shot ratios and expected goal differentials (xGD), points are essentially a meaningless indicator of team ability---or at the very least, a meaningless predictor. I see "predictor" and "indicator" as near-synonyms in this instance, but you may not. Regardless, Sporting's home points total should not even be considered in the discussion of who will win on Saturday. Why not? In addition to out-shooting its opponents in every single home game this season, here is how SKC did relative to the league in xGD this season:

Team GF GA GD xGF xGA xGD "Luck"
LA 32 8 24 32.1 11.0 21.1 2.9
SKC 29 15 14 28.7 11.6 17.1 -3.1
PHI 23 17 6 28.8 16.0 12.8 -6.8
NYRB 32 15 17 26.8 15.4 11.4 5.6
SEA 28 15 13 26.5 15.6 10.9 2.1
COL 28 16 12 25.3 14.5 10.7 1.3
HOU 23 16 7 28.0 18.9 9.1 -2.1
RSL 31 16 15 24.6 16.4 8.2 6.8
CHI 28 19 9 27.0 19.4 7.6 1.4
SJ 23 13 10 28.8 21.2 7.6 2.4
POR 28 11 17 23.9 16.6 7.2 9.8
CLB 19 13 6 25.6 18.8 6.8 -0.8
NE 29 15 14 22.2 17.4 4.8 9.2
MTL 31 19 12 25.1 20.5 4.6 7.4
FCD 28 21 7 24.2 19.8 4.4 2.6
VAN 32 18 14 23.7 19.5 4.2 9.8
DCU 16 27 -11 23.5 21.3 2.2 -13.2
TOR 22 21 1 18.5 19.1 -0.6 1.6
CHV 16 28 -12 18.9 26.0 -7.1 -4.9

SKC has a decent goal differential at home, but more importantly, it has the second-best expected goal differential at home. xGD is an excellent predictor of future success, and a better indication in my mind of true team skill.

Borg goes on to talk about the "road warriors" from Salt Lake City:

"They love playing on the road. Playing at home is too much pressure; they do it better when they're away from home."

No team is better on the road than at home, but whatever. RSL did tie for third in MLS this season with 22 away points earned, but again, we don't care. RSL out-shot it opponents in just five of 17 road games (29.4%), and, well this:

Team GF GA GD xGF xGA xGD "Luck"
SKC 16 15 1 19.3 18.2 1.1 -0.1
SJ 11 29 -18 20.5 21.3 -0.8 -17.2
LA 20 30 -10 18.5 20.2 -1.7 -8.3
FCD 18 28 -10 19.9 23.6 -3.7 -6.3
HOU 17 23 -6 21.7 25.4 -3.8 -2.2
POR 25 22 3 18.9 23.5 -4.6 7.6
COL 15 22 -7 19.9 24.9 -5.1 -1.9
NYRB 24 24 0 19.5 25.6 -6.1 6.1
PHI 19 26 -7 19.4 26.7 -7.3 0.3
NE 19 21 -2 16.1 23.7 -7.6 5.6
CLB 22 33 -11 17.1 26.0 -8.9 -2.1
SEA 11 27 -16 17.9 27.2 -9.4 -6.6
CHI 18 30 -12 20.5 30.0 -9.5 -2.5
MTL 19 29 -10 16.1 26.0 -9.9 -0.1
VAN 21 23 -2 16.6 27.7 -11.0 9.0
TOR 6 25 -19 15.9 27.3 -11.4 -7.6
RSL 25 25 0 17.4 29.7 -12.3 12.3
DCU 5 28 -23 11.9 26.2 -14.3 -8.7
CHV 12 38 -26 12.0 28.9 -16.9 -9.1

Real Salt Lake finished 17th in the league in expected goal differential on the road. Ouch. The fact that their actual goal differential was tied for third in MLS means very little, since xGD makes for a much better Nostradamus.

Unless expected goal differential completely falls apart in home-away splits---which is not likely---we can conclude that Sporting is a good home team, and RSL is a bad away team.

Our current model gives Sporting 72 percent probability of a win. An xGD model---which we don't use yet because we only have one season of data---increases those chances to 88 percent. There is a lot of evidence that Sporting is the better team, and that home field advantage still applies to them. Regardless of Saturday's outcome, those two statements are still well supported.

*Note that these goal statistics do not include own goals, which is why my figures may differ slightly from those found at other sites. 

Does last season matter?

We've shown time and time again how helpful a team's shot rates are in projecting how well that team is likely to do going forward. To this point, however, data has always been contained in-season, ignoring what teams did in past seasons. Since most teams keep large percentages of their personnel, it's worth looking into the predictive power of last season. We don't currently have shot locations for previous seasons, but we do have general shot data going back to 2011. This means that I can look at all the 2012 and 2013 teams, and how important their 2011 and 2012 seasons were, respectively. Here goes.

First, I split each of the 2012 and 2013 seasons into two halves, calculating stats from each half. Let's start by leaving out the previous season's data. Here is the predictive power of shot rates and finishing rates, where the response variable is second-half goal differential.

Stat

Coefficient

P-value

Intercept

-28.36792

0.04%

Attempt Diff (first 17)

0.14244

0.00%

Finishing Diff (first 17)

77.06047

1.18%

Home Remaining

3.37472

0.03%

To summarize, I used total shot attempt differential and finishing rate differential from the first 17 games to predict the goal differential for each team in the final 17 games. Also, I controlled for how many home games each team had remaining. The sample size here is the 56 team-seasons from 2011 through 2013. All three variables are significant in the model, though the individual slopes should be interpreted carefully.*

The residual standard error for this model is high at 6.4 goals of differential. Soccer is random, and predicting exact goal differentials is impossible, but that doesn't mean this regression is worthless. The R-squared value is 0.574, though as James Grayson has pointed out to me, the square root of that figure (0.757) makes more intuitive sense. One might say that we are capable of explaining 57.4 percent of the variance in second-half goal differentials, or 75.7 percent of the standard deviation (sort of). Either way, we're explaining something, and that's cool.

But we're here to talk about the effects of last season, so without further mumbo jumbo, the results of a more-involved linear regression:

Stat

Coefficient

P-value

Intercept

-31.3994

1.59%

Attempt Diff (first 17)

0.12426

0.03%

Attempt Diff (last season)

0.02144

28.03%

Finishing Diff (first 17)

93.27359

1.14%

Finishing Diff (last season)

72.69412

12.09%

Home Remaining

3.71992

1.53%

Now we've added teams' shot and finishing differentials from the previous season. Obviously, I had to cut out the 2011 data (since 2010 is not available to me currently), as well as Montreal's 2012 season (since they made no Impact in 2011**). This left me with a sample size of 37 teams. Though the residual standard error was a little higher at 6.6 goals, the regression now explained 65.2 percent of the variance in second-half goal differential. Larger sample sizes would be nice, and I'll work on that, but for now it seems that---even halfway through a season---the previous season's data may improve the projection, especially when it comes to finishing rates.

But what about projecting outcomes for, say, a team's fourth game of the season? Using its rates from just three games of the current season would lead to shaky projections at best. I theorize that, as a season progresses, the current season's data get more and more important for the prediction, while the previous season's data become relatively less important.

My results were most assuredly inconclusive, but leaned in a rather strange direction. The previous season's shot data was seemingly more helpful in predicting outcomes during the second half of the season than it was in the first half---except, of course, the first few weeks of the season. Specifically, the previous season's shot data was more helpful for predicting games from weeks 21 to 35 than  it was from weeks 6 to 20. This was true for finishing rates, as well, and led me to recheck my data. The data was errorless, and now I'm left to explain why information from a team's previous season helps project game outcomes in the second half of the current season better than the first half.

Anybody want to take a look? Here are the results of some logistic regression models. Note that the coefficients represent the estimated change in (natural) log odds of a home victory.

 Weeks 6 - 20

Coefficient

P-value

Intercept

0.052

67.36%

Home Shot Diff

0.139

0.35%

H Shot Diff (previous)

-0.073

29.30%

Away Shot Diff

-0.079

7.61%

A Shot Diff (previous)

-0.052

47.09%

Weeks 21 - 35

Coefficient

P-value

Intercept

0.036

78.94%

Home Shot Diff

0.087

19.37%

H Shot Diff (previous)

0.181

6.01%

Away Shot Diff

-0.096

15.78%

A Shot Diff (previous)

-0.181

4.85%

Later on in the season, during weeks 21 to 35, the previous season's data actually appears to become more important to the prediction than the current season's data---both in statistical significance and actual significance. This despite the current season's shot data being based on an ample sample of at least 19 games (depending on the specific match in the data set). So I guess I'm comfortable saying that last season matters, but I'm still confused---a condition I face daily.

*The model suggests that each additional home game remaining projects a three-goal improvement in differential (3.37, actually). In a vacuum, that makes no sense. However, we are not vacuuming. Teams that have more home games remaining have also played a tougher schedule. Thus the +3.37 coefficient for each additional home game remaining is also adjusting the projection for teams who's shot rates are suffering due to playing on the road more frequently. 

**Drew hates me right now.

What Piquionne's goal means to Portland

Though our game states data set doesn't yet include all of 2013, it still includes 137 games. In those 137 games, only five home teams ever went down three goals, and all five teams lost. There were 24 games in which the home team went down two goals, with only one winner (4.2%) and five ties (20.8%). The sample of two-goal games perhaps gives a little hope to the Timbers, but these small sample sizes lend themselves to large margins of error. It is also important to note that teams that go down two goals at home tend to be bad teams---like Chivas USA, which litters that particular data set. None of the five teams that ever went down three goals at home made the playoffs this year. Only seven of the 24 teams to go down two goals at home made it to the playoffs. Portland is a good team. Depending on your model of preference, the Timbers are somewhere in the top eight. So even if those probabilities up there hypothetically had small margins of error, they still wouldn't necessarily apply to the Timbers.

Oh, and while we're talking about extra variables, in those games the teams had less time to come back. To work around these confounding variables, I consulted a couple models, and I controlled for team ability using our expected goal differential. Here's what I found.

A logistic model suggests that, for each goal of deficit early in a match, the odds of winning are reduced by a factor of  about two or three. A tie, though, would also allow Portland to play on. A home team's chances winning or tying fall from about 75 percent in a typical game that begins zero-zero, to about 25 percent being down two goals. Down three goals, and that probability plummets to less than 10 percent. But using this particular logistic regression was dangerous, as I was forced to extrapolate for situations that never happen during the regular season---starting a game from behind.

So I went to a linear model. The linear model expects Portland to win by about 0.4 goals. 15.5 percent of home teams in our model were able to perform at least 1.6 goals above expectation, what the Timbers would need to at least force a draw in regulation. Only 4.6 percent of teams performed 2.6 goals above expectation. If we just compromise between what the two models are telling us, then the Timbers probably have about a 20-percent chance to pull off a draw in regulation. That probability would have been closer to five percent had Piquionne not finished a beautiful header in stoppage time.

The Predictive Power of Shot Locations Data

Two articles in particular inspired me this past week---one by Steve Fenn at the Shin Guardian, and the other by Mark Taylor at The Power of Goals. Steve showed us that, during the 2013 season, the expected goal differentials (xGD) derived from the shot locations data were better than any other statistics available at predicting outcomes in the second half of the season. It can be argued that statistics that are predictive are also stable, indicating underlying skill rather than luck or randomness. Mark came along and showed that the individual zones themselves behave differently. For example, Mark's analysis suggested that conversion rates (goal scoring rates) are more skill-driven in zones one, two, and three, but more luck-driven or random in zones four, five, and six. Piecing these fine analyses together, there is reason to believe that a partially regressed version of xGD may be the most predictive. The xGD currently presented on the site regresses all teams fully back league-average finishing rates. However, one might guess that finishing rates in certain zones may be more skill, and thus predictive. Essentially, we may be losing important information by fully regressing finishing rates to league average within each zone.

I assessed the predictive power of finishing rates within each zone by splitting the season into two halves, and then looking at the correlation between finishing rates in each half for each team. The chart is below:

Zone Correlation P-value
1 0.11 65.6%
2 0.26 28.0%
3 -0.08 74.6%
4 -0.41 8.2%
5 -0.33 17.3%
6 -0.14 58.5%

Wow. This surprised me when I saw it. There are no statistically significant correlations---especially when the issue of multiple testing is considered---and some of the suggested correlations are actually negative. Without more seasons of data (they're coming, I promise), my best guess is that finishing rates within each zone are pretty much randomly driven in MLS over 17 games. Thus full regression might be the best way to go in the first half of the season. But just in case...

I grouped zones one, two, and three into the "close-to-the-goal" group, and zones four, five, and six into the "far-from-the-goal" group. The results:

Zone Correlation P-value
Close 0.23 34.5%
Far -0.47 4.1%

Okay, well this is interesting. Yes, the multiple testing problem still exists, but let's assume for a second there actually is a moderate negative correlation for finishing rates in the "far zone." Maybe the scouting report gets out by mid-season, and defenses close out faster on good shooters from distance? Or something else? Or this is all a type-I error---I'm still skeptical of that negative correlation.

Without doing that whole song and dance for finishing rates against, I will say that the results were similar. So full regression on finishing rates for now, more research with more data later!

But now, piggybacking onto what Mark found, there does seem to be skill-based differences in how many total goals are scored by zone. In other words, some teams are designed to thrive off of a few chances from higher-scoring zones, while others perhaps are more willing to go for quantity over quality. The last thing I want to check is whether or not the expected goal differentials separated by zone contain more predictive information than when lumped together.

Like some of Mark's work implied, I found that our expected goal differentials inside the box are very predictive of a team's actual second-half goal differentials inside the box---the correlation coefficient was 0.672, better than simple goal differential which registered a correlation of 0.546. This means that perhaps the expected goal differentials from zones one, two, and three should get more weight in a prediction formula. Additionally, having a better goal differential outside the box, specifically in zones five and six, is probably not a good thing. That would just mean that a team is taking too many shots from poor scoring zones. In the end, I went with a model that used attempt difference from each zone, and here's the best model I found.*

Zone Coefficient P-value
(Intercept) -0.61 0.98
Zones 1, 3, 4 1.66 0.29
Zone 2 6.35 0.01
Zones 5, 6 -1.11 0.41

*Extremely similar results to using expected goal differential, since xGD within each zone is a linear function of attempts.

The R-squared for this model was 0.708, beating out the model that just used overall expected goal differential (0.650). The zone that stabilized fastest was zone two, which makes sense since about a third of all attempts come from zone two. Bigger sample sizes help with stabilization. For those curious, the inputs here were attempt differences per game over the first seventeen games, and the response output is predicted total goal differential in the second half of the season.

Not that there is a closed-the-door conclusion to this research, but I would suggest that each zone contains unique information, and separating those zones out some could strengthen predictions by a measurable amount. I would also suggest that breaking shots down by angle and distance, and then kicked and headed, would be even better. We all have our fantasies.