Shots in the Dark: how data providers tell us different versions of what happened
/Recently, this tweet created a small firestorm in the soccer analytics community. While it is unclear the source of the error, it was pretty clear that there weren’t 1,300 passes and 50 shots in an English League 2 match. This led to responses from prominent analysts such as StatsBomb’s Ted Knutson (including on his podcast [starts at 10:45]), Opta’s (and ASA alum) Tom Worville and Ryan Bahia, and Chris Anderson, author of The Numbers Game. All of them were saying pretty much the same thing: question the data you are using. If the data you are using to analyze a problem is not valid, then your solutions won’t be either.
So what do we know about the data that is used for soccer analysis? Previous studies have shown that people are pretty good at agreeing about what type of event occured in a soccer game (e.g. shots, tackles). But as far as I can tell, the accuracy and precision of locations of game events among the various data providers has not been studied. As Joe Mulberry pointed out when looking at the troubling inconsistencies between spatial tracking data and event data, small differences in locations can have big effects on downstream analysis including expected goals (xG) models. In other words, small inconsistencies in how data is tracked can have big consequences for the models built off that data. So what are the differences between how soccer data providers collect and report their data?
Read More