By Kevin Shank (@shankapotomus96)
In my Sports Analytics class at Saint Joseph's University, my professor would always stress the importance of having a valid data source; “Put garbage in, get garbage out,” he would tell the class. If the data has a bias, isn’t random, or is miscalculated, then any resulting conclusion is not credible. In order to have a sound analytic method, it is imperative that the data source is not “garbage.” For the course’s final project, I chose to analyze players’ cost efficiency and also use binary integer programming to build an optimal lineup. Ironically enough, I decided to have my data source be none other than the Audi Player Index.
I was well aware of the mystery behind the Audi Player Index (API), but the ambiguity made it all the more enticing for me to study. However, getting my hands on the data was the tricky part since no one at MLS decided to respond to my requests of seeing only the API results. That left me with the hard way of manually getting the data from the HTML code individually from each of the 340 games through MLS’s Match Center.
Collecting all this data left me with the average API for each player and team, and the results seemed to make sense at first glance. It makes sense that Sebastian Giovinco, with 17 goals and 15 assists, finished as the top performer for the season with an API of 997. Under MLS’s constraints for the API awards of at least 24 games and 1530 minutes played, the entire Top 10 seem to all pass the “eye test” as to whom would be the best players in the league with each player contributing at least 18 combined goals or assists. When looking at the teams’ API versus their actual points, Colorado and Columbus are the notable outliers; yet the rest of the data seems reasonable. Removing those two outliers would yield a fairly strong correlation coefficient of .8 between API and points, meaning that as a team’s API increases so does their points with little variance (.49 with the Rapids and Crew in the dataset – moderate correlation with more variance). Furthermore, the API of playoff teams bested their non-playoff counterparts by an average of 235. By looking at these results, the API seems sensible.
See below for a player API dashboard. Team API dashboard is available here.
However, the API’s looks are quite deceiving. Sometime last week, MLS decided to release the average API scores on their site and I cross referenced their data with the data I collected from the Match Center. From the MLS data, Vancouver’s Paolo Tornaghi was ranked 3rd with a score of 967, but for my dataset he was 90th with a score of 371. If both datasets were based from the same API results, how could there be such a discrepancy? Well, I looked into it and while Tornaghi only played one game, he showed up twice in my game-by-game data. On April 9th with the Whitecaps’ 4-0 loss to DC United, Tornaghi got a score of -225 for conceding three goals within the box, but there is a problem. He did not even play in that game; in fact there were eight different times this season when the API has awarded scores to unused subs, some of whom didn't even log a minute this whole season. Furthermore, the games with these random API scores ultimately affected the team’s score; for example, the Whitecaps’ API score against DC was 2425, which included Tornaghi's wrong score. There were also three players in MLS’s release that did not account for games where they played and had a score of 0, causing their average API to be greater than corresponding scores in my dataset. With MLS officially releasing the regular season API scores, there should not be these flaws between the data on MLS’s API page and the Match Center. Also if there were eight known cases where unused subs got an API score, then who knows how many players who did play in games were also affected by this random allocation?
Even without all of these flaws in the results, I haven’t even mentioned the reason so many are skeptical of the API: no one knows how it works. With no knowledge of the algorithm, the worth of a goal, pass, save, or any other stat is unknown. Sure MLS’s site can give some insight, stating that an aerial won by a forward is worth 10 points, but it also says that the API is also calculated by an “appreciation/depreciation” based on technique, dynamics, and skill. This language sounds like Forward A can get 10 points for an aerial won, but Forward B could get even more points just because he did it better. The API is starting to seem like a subjective measure than an objective one. Indeed, through collecting player actions and their API score, I have reason to believe so. The MLS Match Center can allow users to look at the amount of player actions that contribute to the API, such as number of shots inside the box, successful passes in the final third, or tackles won.
By using the same technique of retrieving the players’ and teams’ API score, I was able to use the CSS code to get a breakdown of a player’s score. Since there are 86 different components of the API, an equation can be formed to represent a player’s performance; for Giovinco’s game, it would be the equation below, where the variable is a given player action, the coefficient the number of player actions that occurred, and 1351 his game’s API score.
Since there are different scoring metrics for various positions, I chose to analyze goalkeepers because their range of variables would be limited to mainly saves rather than a mix of defensive and offensive stats. As a result, I collected 45 different goalkeeper equations that had 43 variables in total. By using Linear Algebra, I put these equations into matrix form so I could solve for each variable simultaneously. If each action was continually rated the same (i.e. an aerial won is always 10 points), then this system of equations should come out evenly, but my findings were not non-logical. The chart shows some of the player actions’ value that I solved for, and it clearly shows that when I put garbage in, I got garbage out. Positive actions like successful passes were awarded a high negative score and negative actions like conceding goals were awarded positive scores, making it apparent that the data source is flawed.
Granted, the Match Center does state that the “computation of the Audi Player Index score is proprietary information and as such, some scoring metrics are not listed.” Since MLS and Audi are hiding data from us, I decided to perform a several regressions, where in each iteration I removed statistically insignificant data. The yellow player actions reveal statistically significant events (p value < .05). These results are imperfect because there is influential data that MLS and Audi are keeping to themselves, but these coefficients can provide some insight into a rough estimate of what the modifiers really are. The significant player actions make sense in terms of positive actions are awarded points while negative actions deduct points with the exception of a ball collected save being worth -55 API points. This regression reveals that a goal conceded from inside the box has a value of -73, which is reasonable considering that Tornaghi had a score of -225 for “conceding” three goals inside the box (-75 per goal). Performing the regression may have closed in on some of the API’s rating system, but without data for the entire 86 component dataset, it cannot be solved.
|Actions||coeff||std err||t stat||p-value|
Another area where the API fails is the differences in scores between positions. In order to determine if there is a difference between the player averages by position, I performed an ANOVA statistical test. The null hypothesis is that there the average API scores for each player are equal while the alternative hypothesis is that they are not equal. The test yielded a p value of 7.25E-08, which is less than .05; therefore, the null hypothesis is rejected, showing that there is statistically significant evidence that the average API scores for each position differ. This analysis highlights another flaw in the API, which is likely caused due to the different scoring metrics for each position. MLS and Audi reveal that a key pass for a forward is worth 35 points yet a key pass for a midfielder is worth 45 points. While it is understandable to attempt to vary the scoring system between positions, the API’s scoring variances cause a noticeable imbalance where the average goalkeeper outperforms the average defender by 74.5%. The vast differences between positions make comparing the performances of players of different positions equivalent to comparing apples and oranges. In order for the API to credible, there needs to be no statistically significant differences between the groups so that it is feasible to compare a defender to a forward.
Audi and MLS can fix all of this confusion by shedding some light on what exactly the API is. Without knowledge of what the API is actually scoring, rather than being an accurate way to gauge player performance, it will be just a number to fans and analysts. It is a shame that the API is not a proper stat because building my binary integer programming model to construct lineups was pretty fun (free pdf here). I wanted to expand my model to account for the complexity of salary cap rules, like using DP’s, Targeted Allocation Money, or Generation Adidas players; however, concerns over the reliability of the API discourages its use in the study of sport analytics, which is a detriment to the field. Until MLS or Audi releases what the API is made of, any subsequent research or analytics will have little credibility because putting garbage in will get garbage out.