The DePo Models: Bringing Moneyball to Professional Soccer

By Sam Goldberg & Mike Imburgio

With a worldwide pandemic affecting soccer operations for nearly every club in the world, it is more critical than ever that teams are financially responsible when signing new players and negotiating new contracts. Despite the current world climate, players still need to be purchased and sold, contracts still need to be worked out, and decisions still need to be made to ensure the survival of the team and the business. With money being tight, it is imperative that clubs allocate their budgets carefully and spend wisely so as to not waste funds during this crucial period. Clubs need to ensure that they are paying players what they are worth, in addition to agreeing to transfer fees where they are not overpaying for the on-field performances they are going to receive. The DePo models, which will be introduced below, could help ensure that clubs are getting the most value out of their potential new signings.

In some cases, however, overpaying for actual performance makes sense. For example, a Designated Player coming to MLS from Europe at the end of their careers. The salary or fee paid to or for the player is often greater than the on-field performance received by the team, but the cost could potentially be returned in leadership, merchandising, ticket sales, and brand awareness. Bastian Schweinsteiger signed for the Chicago Fire in 2017 and performed slightly above the average player in his position group. Despite his slightly above average performances, Bastian received a salary of $5.6 Million, making him a top 5 earner in MLS. This, however, was never viewed as problematic by the Fire nor their fans, as he raised the franchise’s profile for his three years in MLS and ticket sales immediately increased after his signing. In this case, the Chicago Fire knowingly overpaid for his actual on-field performance, but they more than made up for it in sales and leadership in the locker room. This is a perfect example of a knowledgeable overspend on a player. 

Just as there are knowledgeable overspends, there are more knowledgeable underspends. This is when a club knowingly pays less than the player’s market value but receives a “diamond in the rough” player in return. A more recent example of a diamond in the rough success story is when Tottenham signed Dele Alli from MK Dons. Signed for only €5,000,000, Dele is now worth well above €50,000,000 and widely regarded as one of the top midfielders in the Premier League. 

Underspending on top value is the goal of every player recruitment team and front office around the world. Originally made famous by Michael Lewis’ book describing Billy Beane and Paul DePodesta’s roster-building strategy with the Oakland Athletics of Major League Baseball, the process of identifying and signing these “diamonds in the rough” is often referred to as Moneyball

The Background to the Models

Let’s start off by making sure all of our bases are covered (pun intended). Salaries and transfer fees are different things. Salary is how much a player gets paid for playing week in and week out. It’s their paycheck. A transfer fee (which really only exists in professional soccer) is how much a team pays another team in order to gain the services of that player. Transfer fees can sometimes be exorbitantly high and other times be ridiculously low. 

In 2017, Neymar transferred from Barcelona to PSG for over $200 Million. Transfer fees can also be $0, or “a free transfer”, for players who are at the end of their contract. In 2011, AC Milan let Andrea Pirlo, one of the best midfielders of all time, join Juventus for free. Poor planning by teams can allow high-value players to leave for free at the end of their contract, when, if sold a year earlier, they could have garnered fees into the millions. 

Teams will also inflate the value of players who are fan favorites, players with high potential, or players that teams simply do not want to get rid of. These factors, in addition to expected new sales, experience, and more, lead to incredibly noisy and clouded data where a player’s true on-field performance does not necessarily reflect their total market value. 

Additionally, the only public source of market values of each player comes from TransferMarkt, a publicly crowdsourced valuation. While this can help give fans a reasonable estimate of a potential valuation, the value does not come from the teams themselves or people with knowledge of the true valuation, and as such are rarely the amount for which a player is actually sold. All of these factors combined make predicting a player’s true market value very difficult. 

Salary is also difficult to predict. Expected Compensation models, which predict how much a player should be earning based on their performance, can become noisy due to the factors such as the lack of publicly available data, players signing multi-year contracts which pay out the same value year after year, or injuries which cause them to play fewer minutes. While a player that signs a four-year contract is theoretically expected to perform at the same level year after year for all four years, the reality is that their performance will fluctuate at a higher rate than their salary.

Additionally, league levels and rules can change how a player is compensated for their services. Teams in Europe’s top five leagues face very little regulation, while Major League Soccer is beholden to roster rules defining the level at which players have to be paid. Luckily, we can use some different tricks to overcome these problems.

How Do the DePo xCompensation and xTransferValue Models Work?

In order to try and match the perceived thought processes of ownership groups and front offices in estimating monetary value from player performance, we split players into one of five salary groups dependent on the league they play in and the salary they make. In MLS, players are split into the following groups: Bench Player, GAM Player, TAM Player, Designated Player, and Elite Designated Player. For Europe’s Top 5 Leagues, DePo splits the players into Bench Player, Minutes Getter, Starter, Captain, and Elite Player. We’ve established a minimum and maximum salary for each group of players.

Once this split was completed, we used a classification model to determine which Salary Group each player belonged to using performance-related measures in addition to each player’s age, the quality of their team, and the league they play in. For each player, the model assigns a proportion of that player to each Salary Group, acknowledging the disparities in pay and talent that you see all the time. From those proportions, we were able to estimate the xComp for each player based on an average of the typical compensations in each Salary Group, weighted by each player’s model proportions. 

In order to estimate a player’s market value, we ran an algorithm using similar metrics to those used to estimate salary, but also included “days until contract termination” as a factor.

By comparing xComp to actual Salary and xMarketValue to actual Market Value, DePo could save front offices money in the long run by identifying undervalued players that perform at the same level as their overpaid counterparts. This process matches the system the Oakland A’s developed, which inspired an analytics revolution in baseball and made the concept of Moneyball famous worldwide. 

For those of you who’d like to know more about the inner workings of the model, we’ve included a more detailed explanation at the end of the article.

Identifying Undervalued Players In MLS - Introducing Dollar Per Goal Added

In order to demonstrate that this process works, the next section will identify players that play over their expected market value.

The first example is the namesake of our original research: Alphonso Davies (You can read that article here). In 2018, prior to his move to Bayern Munich, Davies had a breakout year for the Vancouver Whitecaps on a salary of $72,500. Based on his performance that year for Vancouver, his xComp was $678,445. That season, Vancouver only had to spend $7,000 Dollars per Goal Added (DPGA) on Alphonso Davies, whereas the rest of MLS paid an average of $100,000 DPGA for players in the same position group. 

In 2019, Toronto and Seattle squared off in the MLS Cup Final, due in no small part to two of the top data scientists in the league: Ravi Ramineni and Devin Pleuer. Each of their teams, respectively, contained players that outperformed expectations, but one player stood out: Toronto’s Richie Laryea. Throughout the season he put up nearly 5.0 estimated Goals Added based on the aforementioned DAVIES model, similar to some of MLS’s biggest names, but did so on a salary of only $56,000. This overperformance was reflected in the DePo model, which estimated based on his performance that Richie should have been compensated upwards of $200,000. 

Using this framework, we can also measure which teams spend most efficiently when it comes to finding undervalued players who outperform their salary expectations. In terms of true performance of players, without taking into account the team’s budget or spend, LAFC spent only $125,000 DPGA while amassing a league-leading 88 Total Goals Added over the course of the season. LAFC also paid out $127,000 dollars less per player than their xComp, saving the team a total of over $1,000,000 on salary costs throughout the year.

Another great example of a great undervalued player was the Houston Dynamo’s Alberth Elis. In both 2018 and 2019, Elis put up over 10 Predicted Goals Added on a salary of $650,000. DePo predicts that based on Elis’ performance alone, his xComp should have been over $825,000 in 2018, and over $1,000,000 in 2019. While most teams in MLS were paying close to $200,000 DPGA for players similar to Elis’ playstyle, Houston only had to pay ~$65,000 DPGA. 

Using this framework, we can also measure which teams spend most efficiently when it comes to finding undervalued players who outperform their salary expectations. 

In MLS over the past two years, the number one salary outperforming team has been the Houston Dynamo. This overperformance in salary has happened in part due to Elis’ consistent and stellar performances, but also in part to their data scientist, ASA’s own Sean Steffen, who helps identify other players similar to Elis to sign. Over the past two seasons, Houston had an average team salary of ~$300,000 but had an average expected salary of $430,000, saving them an average of $130,000 in salary costs per player and ~$5,000,000 in total expected compensation.

In terms of true performance of players, without taking into account the team’s budget or spend, LAFC spent only $125,000 DPGA while amassing a league-leading 88 Total Goals Added over the course of the season. LAFC also paid its players $127,000 dollars less on average than their xComp, saving the team a total of over $1,000,000 on salary costs throughout the year.

Identifying Undervalued Players in Europe - The DePo Salary and Transfer Fee Models

These themes stay consistent in Europe, and Alphonso Davies can be used again as an example. His contract value for the 2019-2020 season with Bayern Munich was €588,000. Throughout the season, he amassed a whopping 7.4 estimated Goals Added in the Bundesliga. Based on his on-field performance during that campaign, DePo predicts that he should have been compensated €1,400,000—€800,000 more than his contract was worth. Bayern Munich also underpaid the Vancouver Whitecaps for Alphonso’s services. Bayern purchased Alphonso Davies in 2018 for $22,000,000, though he currently has an xMarketValue of €68,000,000. This one good piece of business will see Bayern make at least a $50,000,000 return on investment if they decide to sell Alphonso Davies. 

Chelsea, who recently had a very active transfer window, did not misplace their funds either. After Timo Werner’s 2019-2020 campaign with RB Leipzig, Chelsea signed him for about €60,000,000. The DePo xMarketValue Model predicts that Werner should have gone to Chelsea for nearly €80,000,000 based on his Bundesliga performances in the season prior. This signals that Chelsea got a great deal for a player with a high future ceiling. 

Another signing which was good business for the receiving team came in Liverpool’s signing of Diego Jota from Wolves. It comes as no surprise to anyone in the analytics community that Will Spearman and the team of Liverpool’s data scientists consistently put into place good processes to identify talent, and this time it was no different. TransferMarkt listed Jota as being worth about €28,000,000 prior to the start of the 2020-2021 Premier League campaign. Liverpool paid €45,000,000 for the Wolves striker, with a youth prospect going to Wolves for €13,000,000. This saw Liverpool net spend €32,000,000 on Jota, which was still €5,000,000 less than his DePo xTransfer Value of €37,000,000.

A team-level analysis is also possible for Europe’s Top 5 Leagues. The “Moneyball team of Italy,” Atalanta, who reached the Champion’s League Quarter-Finals in 2019-2020, paid an average of nearly €1,000,000 less than expected for talent that put up similar Goals Added numbers to the world’s best. 

DePo is Another Tool, Not an End All Be All

The DePo models can help level the playing field between teams with low and high budgets when it comes to player recruitment. DePo, however, should not be used as an “end all be all” valuation methodology, but rather as another tool that helps identify players to sign. If a team strongly believes that a given player is worth more than what DePo projects, and signs them anyway, that is an informed decision. For example, a club may identify certain skills/traits in a player that they will be better equipped to extract value from due to their manager/system/surrounding personnel. The same can be said for the players that are not signed because the opposing team’s valuation of the player was too far off from what DePo suggested. 

There is more to a signing than just projected value based on on-field performance. How are they going to interact in the locker room? How are they as a person? Will they buy into the club’s philosophy? These are all questions that DePo cannot answer, but still should be taken into account when signing a target. In some cases, the answers to these questions are worth more than the valuation alone. Coach’s opinions matter. Front Office opinions matter. The DePo projections, combined with the expert opinion of a coach and recruitment team, can allow clubs to make the most informed decision possible about whether or not to sign a potential target, and how much to pay him once he is signed. 

Where Can I Find This Information?

We would be doing a disservice to teams and fans across the globe if we did not release this information for free. You can find the Salary, xComp, Current Market Value, and xMarketValue for players in Europe’s Top 5 Leagues at the hyperlink below. You will also be able to find the Salaries and xComp for MLS players at the same link. Just use the filters to find exactly what you are looking for!

Alphonso: A Front End For DAVIES

Conclusion

The DePo Models can be used to help clubs make informed decisions on signing potential targets by estimating monetary value from on-field contributions. Combined with the Alphonso Front End, teams and fans alike can access a wealth of information about a player all in one centrally located place. This framework allows for ease in decision making and concatenates information to a single source. Additionally, DePo can save teams a wealth of money during a time in the world where cash flow is a problem. However, with a small budget in the short term for a data scientist and a data stream, teams will be able to save millions in the long run and be able to make informed and knowledgeable decisions. 

Read below for an in-depth mathematical discussion on how DePo works!

The DePo Salary model is composed of two stages - the first involves a Random Forest model, where Salary Groups (defined manually on real salary data) are used as classes to be “predicted.” The second stage uses average (real) salaries within each Salary Group in combination with the class probabilities generated from the random forest model in the first stage to calculate the final xComp. We built separate models for MLS and European leagues to account for differences in the relationship between on-field performance and salary. This is necessary for a few reasons, but the major reasons are the salary cap and related rules in the MLS and the more complex relationship between age and Salary in the MLS due to older players coming into the league from Europe.

Salary Groups are determined using separate cutoffs for MLS and European leagues for similar reasons. For MLS players, the groups were determined partially by using the league rules for salary allocation - the groups were top Bench Players (less than $200k), GAM ($200k - $700k), TAM ($700k - $1.75mil), DPs ($1.75mil - $3.5mil), and Elite DPs (greater than $3.5mil). For European leagues, groups were Bench Players (less than €2mil), Minute Getters (€2mil - €6mil), Starters (€6mil - €10mil), Captains (€10mil - €20mil), and Elite Players (greater than €20mil). We formed these groups based on how we believed front offices think about allocating salary in terms of on-field performance. We then slightly varied these cutoff values in early iterations of the model to determine which cutoffs yielded xComps that most closely matched Salaries in the final model predictions. However, these values were not subjected to a more rigorous parameter tuning treatment, and so it is possible that model performance might be slightly improved by changing these cutoffs in the future.

As mentioned above, these Salary Groups were then used as classes in random forest classification models. The variable inputs to the random forest models differ across MLS and European league models. In European leagues, which are free of salary caps and related rules, salary groups could be estimated fairly accurately using a single performance measure (DAVIES goals added) along with age, league and minutes played. MLS salary group estimation proved more difficult; we ended up including a number of performance-related variables in addition to goals added, which allowed the model to learn which types of players are more likely to be awarded DP designations (a simpler model closer to the European model could not classify any DPs correctly). We also did not include age in the MLS model, as older incoming players from Europe make the relationships between age, salary, and level of play much more complex in the MLS comparatively. Finally, we included the specific MLS season as a predictor in the MLS model to account for inflation as well as changes in salary rules across seasons.

In sum, the MLS model included season, a number of performance measures, and minutes played; the European model included predicted goals added, age, league, and minutes played. The classification accuracy of these models was between 65-70%. It’s important to note here that we expected errors in classification resulting from these models, and further that these errors would be meaningful in the next step. For example, a player whose true salary is in the ‘Elite Player’ range but performs poorly should be classified by our model into a lower Salary Group, as their performance does not match their actual salary. In addition, the model compares performance in a given season to salary in a given season with no regard to when the salary was awarded to the player, leading to inevitable mismatches when performance changes significantly after the salary was awarded (which we hoped the difference between salary and xComp would reflect). Class “probabilities,” or weights, were then generated from these models fit on the full dataset to be used in the following step.

In step two, we first calculated the average salary within each Salary Group. Messi and Ronaldo, the ever-present outliers, were excluded from these averages because they broke everything (more on the model’s performance for the very highest earners below). We further broke down these averages by either position (for MLS) or playstyle (for European leagues). The position/playstyle difference across models was mainly a result of data availability—while ASA lists detailed positions for all players (Wing, Attacking Mid, etc.) we had only general positions (D/M/F) from FBref for European leagues. As a stand-in for more specific positions, we used the play-style clusters generated by our DAVIES model for European players. We used general categories rather than specific categories (e.g. Attacker rather than Playmaker), as the general categories are far more stable season-to-season, avoiding major valuation changes if a player adapted to a new specific style. We also broke these averages down by season in the MLS model as MLS salary rules change year-to-year. 

Then, for each player, we began by multiplying the weight of each Salary Group for each player by the average for that player’s position in that Salary Group. For example, a player classified with a 90% weight for TAM and 10% weight for GAM would have an estimated salary of (0.9*avg TAM + 0.1*avg GAM), where averages are position-specific (and season-specific in the MLS model). However, this alone leads to downward bias in xComp for the players in the top half of a Salary Group; these players would generally be classified as 100% probable for that group and assigned the group average, usually lower than their actual salaries. To somewhat correct for this bias, players with very high probabilities for high salary groups were assigned a weighted average of the average salary in the group and the max salary in the group (average + max / 1.85). In the final model, the xComp of a hypothetical 90% TAM/10% GAM player would be calculated as (0.9*(avg TAM + max TAM/1.85) + 0.1*avg GAM). The bias still remains for the very highest-paid players - for this reason, the most ‘overpaid’ players are these highest-paid players. However, given that these players are compensated for much more than on-field play, and that this approach seemed to work for the vast majority of players, we considered this bias acceptable for the purpose of our model. 

We then compared these xComps to actual Salaries in both leagues. Final model estimations were $188,000 from actual salaries in MLS and €1,000,000 from actual salaries in Europe. The larger difference in Europe is due to the large difference in actual salaries. Both models accounted for more than 75% of the variance in salaries. We considered these numbers to be close enough to actual salaries for the model to be rooted in reality, but the estimations are by no means perfect. Importantly, though, we expected estimation errors in the final model, and more importantly we expected the errors to be meaningful: a player who does not perform up to salary expectations should have an xComp well below their actual salary, for example. From here, we examined the larger prediction errors to make sure they seemed to point to players that truly played above or below what we might expect from their actual salary.

In general, the xComp model’s over/under predictions matched our intuition (see above write-up for examples) - some specific cases indicate possible overfitting during the random forest procedure. Mesut Ozil and Alexis Sanchez during their time in England stood out as being classified as highly paid players when, by performance alone, they probably shouldn’t be. Notably though, the model still identifies both as overpaid (just not by as much as we would have expected). Future models building on our framework might want to test other types of statistical models that assume more linear relationships between performance and salary than our random forest model did in order to avoid cases like this. 

Similar problems likely exist in the MLS model, where many more variables were included in the random forest. Because of the large number of performance variables necessary to estimate salary groups in the MLS, the relationship between raw performance and xComp in the MLS is not as straightforward as in the European leagues. Dollar per Goal Added, however, does provide a much more straightforward interpretation for those who wish to see how raw contribution and salary are related more directly. The Dollar per Goal Added calculation is simple (estimated DAVIES goals added/salary); we used DAVIES goals added here rather than ASA’s actual goals added (g+) for consistency in calculation, as we did not have actual g+ for European players.  

It’s also important to note that the framework of the xComp model depends on having information about actual salaries; it’s meant to be applied retrospectively rather than to predict future xComp. A model meant to predict future xComp will likely have to tease out performance prior to a new contract vs. performance during an ongoing contract, among other issues not addressed here. However, a model able to predict xComp in future seasons would be very valuable, and we look forward to someone solving the problems that it would require.

Further, the DePo xComp approach is bound to underpredict salaries for the majority of very top earners. However, players earning at this level (think Messi, Ronaldo, Neymar, Mbappe, etc.) are paid for so much more than their on-field play that we felt a model meant to be applied to the other 99% of players is bound to underpredict their value anyway. Interestingly, the MLS model identified that Vela, Zlatan, and Giovinco’s best seasons (arguably the best player seasons in the MLS of all time) did provide enough on-field value alone to justify their salary despite being top earners in the league. On the whole, our xComp approach involving comparisons of class probabilities to average group salaries seemed to work well despite these limitations.

The DePo transfer fee model is much more straightforward. This model is simply a random forest model fit to TransferMarkt valuations at the end of a season using the performance from the preceding season. The variable inputs to the random forest model were age, days until contract termination, DAVIES goals added, playstyle cluster from our DAVIES model, minutes played, and team ELO. In a cross-validation to test the model’s accuracy, the model produced predictions about €9,000,000 from TransferMarkt’s values and accounted for 70% of the variance in fees. As in the xComp model, we further evaluated the model on the usefulness of its prediction errors.

Similar to the xComp model, the transfer fee model systematically underpredicted TransferMarkt’s top valued players. Again, we felt that this bias was sensible given the objectives of the model. While DePo aims to estimate xTransferFee based on a player’s performance, actual valuations of the top players in the world reflect much more than just on-field performance. We feel that isolating the estimated monetary value of on-field performance in the form of xComp and xTransferFees from DePo is a valuable tool for decision-making, even if it doesn’t capture these off-field contributions to monetary value.