Introducing Glass Onion by US Soccer, An Identifier Synchronization Tool

By Akshay Easwaran

Every soccer data analysis group has had the same jump-scare: they sign data provider contract #2, and suddenly, they need a solution to link teams, matches, and players across their data ecosystem. Individual data sources are easy to ingest: you can tuck them in their own little schemas, write code specific to their little universes, and rest easy that changes won’t come thick and fast without proper notice from your providers. But with those individual data sources come individual standards for tracking objects (players, teams, or matches). In most (if not all) major American sports, working across providers is easy: just use an object’s single source of truth identifier from the league itself. But as every analyst finds out very quickly, there is no public single source of truth identifier for teams, matches, and players in international soccer.

Thus, often-single-person data analytics departments have a massive data engineering challenge on their hands. For the 30 players on a senior roster and just two data sources, they could probably get away with updating a spreadsheet or database table manually. But what about the players on the second team? Those in the academy? Those in future acquisitions? What happens if players get married and change their names? Suddenly, a once very, very small player universe has gotten very, very, very large indeed, and practitioners must have something in place ASAP for the next window or match-week, come hell or high water. Does it matter if that solution is unwieldly long term? Not to the director of recruitment or high performance: they just care that the metrics they care about show up in the club web app for every player.

Our group at US Soccer (the Data Analytics team in High Performance) ran into this challenge last year when blending insights from our event and tracking data pipelines. In our preliminary research, we came across two pieces of prior art:

  • This 2022 blog post by @UnravelSports kickstarted a discussion on how to do player synchronization “right”, evaluating the use of cosine similarity and emphasizing the use of “higher order objects” to synchronize others: unifying teams unlocks the ability to synchronize matches, which unlocks player-level synchronization.

  • Parma Calcio (who should be commended for being the first organization we’d seen talk about this problem publicly!) offered a solution to this problem for just players using cosine similarity and birth dates.

Blending these two sources of inspiration with our experience at clubs and with various data vendors, we asked ourselves: what does a solution in this space need to do to simplify such a very complex problem across common object types in soccer? We believe an optimal solution should be:

  • Provider-agnostic: there is no single source of truth, so establish some commonality and synchronize everything to everything else

  • Flexible: different data sources have different levels of coverage fidelity

  • Extensible: anyone can add new object types easily

We believe we’ve achieved all three of these criteria with the release of our new open-source project: Glass Onion.

How can I use Glass Onion?

Let’s say you want to build an attacker profile using ASA’s receiving G+ and Statsbomb’s shot xG and you have these two dataframes of metrics for a given match:

asa_player_id player_name player_nickname team_id team_name raw_receiving_g+
KAqBNEpWqb Leonardo Campana Leonardo Campana MIA Inter Miami CF 0.2458
KXMe8WD1Q6 Facundo Farías Facundo Farías MIA Inter Miami CF 0.1069
KXMe8gzPQ6 Luquinhas Luquinhas NYRB New York Red Bulls 0.0417
Oa5wdYB9q1 Elias Manoel Elias Manoel NYRB New York Red Bulls 0.0563
Oa5wdkVYq1 Jorge Cabezas Jorge Cabezas NYRB New York Red Bulls 0.006
OlMl9jG05L Cory Burke Cory Burke NYRB New York Red Bulls 0.0019
Pk5Lgax7MO Robbie Robinson Robbie Robinson MIA Inter Miami CF 0.0072
eVq3jBjj5W Tom Barlow Tom Barlow NYRB New York Red Bulls 0.1236
eVq3y38yMW Benjamin Cremaschi Benjamin Cremaschi MIA Inter Miami CF 0.0123
vzqo78BJqa Omir Fernández Omir Fernández NYRB New York Red Bulls 0.1843
vzqoJrEk5a Diego Gómez Diego Gómez MIA Inter Miami CF 0.0847
vzqowm7qap Lionel Messi Lionel Messi MIA Inter Miami CF 0.2935
statsbomb_player_id player_name player_nickname team_id team_name statsbomb_shot_xg
5503 Lionel Andrés Messi Cuccittini Lionel Andrés Messi Cuccittini MIA Inter Miami 0.961997
32878 Facundo Farías Facundo Farías MIA Inter Miami 0.080789
36149 Robbie Robinson Belmar Robbie Robinson Belmar MIA Inter Miami 0
41824 Leonardo Campana Romero Leonardo Campana Romero MIA Inter Miami 0.252315
222155 Diego Alexander Gómez Amarilla Diego Alexander Gómez Amarilla MIA Inter Miami 0.088681
225209 Benjamin Cremaschi Benjamin Cremaschi MIA Inter Miami 0
12360 Cory Burke Cory Burke NYRB New York Red Bulls 0
23304 Lucas Lima Linhares Lucas Lima Linhares NYRB New York Red Bulls 0.048729
23859 Omir Guadalupe Fernandez Mosso Omir Guadalupe Fernandez Mosso NYRB New York Red Bulls 0.417629
24955 Tom Barlow Tom Barlow NYRB New York Red Bulls 0.246331
112990 Elias Manoel Alves de Paula Elias Manoel Alves de Paula NYRB New York Red Bulls 0
379652 Peter Stroud Peter Stroud NYRB New York Red Bulls 0

You can use Glass Onion to synchronize players like so…

asa_content = PlayerSyncableContent(
    provider="asa",
    data=asa_player_df.drop(["raw_receiving_g+"], axis=1)
)

statsbomb_content = PlayerSyncableContent(
    provider="statsbomb",
    data=statsbomb_player_df.drop(["statsbomb_shot_xg"], axis=1)
)
engine = PlayerSyncEngine(
    content=[asa_content, statsbomb_content],
    verbose=True
)
result = engine.synchronize()
result.data
team_id player_name asa_player_id statsbomb_player_id provider
MIA Benjamin Cremaschi eVq3y38yMW 225209 asa
NYRB Cory Burke OlMl9jG05L 12360 asa
MIA Diego Gómez vzqoJrEk5a 222155 asa
NYRB Elias Manoel Oa5wdYB9q1 112990 asa
MIA Facundo Farías KXMe8WD1Q6 32878 asa
NYRB Jorge Cabezas Oa5wdkVYq1 379652 asa
MIA Leonardo Campana KAqBNEpWqb 41824 asa
MIA Lionel Messi vzqowm7qap 5503 asa
NYRB Luquinhas KXMe8gzPQ6 23304 asa
NYRB Omir Fernández vzqo78BJqa 23859 asa
MIA Robbie Robinson Pk5Lgax7MO 36149 asa
NYRB Tom Barlow eVq3jBjj5W 24955 asa

…so you can produce this table with both metrics side by side:

composite = pd.merge(result.data[["player_name", "team_id", "asa_player_id", "statsbomb_player_id"]], asa_player_df[["asa_player_id", "raw_receiving_g+"]], on="asa_player_id")
composite = pd.merge(composite, statsbomb_player_df[["statsbomb_player_id", "statsbomb_shot_xg"]], on="statsbomb_player_id")
composite.sort_values(by="raw_receiving_g+", ascending=False).head()
player_name team_id asa_player_id statsbomb_player_id raw_receiving_g+ statsbomb_shot_xg
Lionel Messi MIA vzqowm7qap 5503 0.2935 0.961997
Leonardo Campana MIA KAqBNEpWqb 41824 0.2458 0.252315
Omir Fernández NYRB vzqo78BJqa 23859 0.1843 0.417629
Tom Barlow NYRB eVq3jBjj5W 24955 0.1236 0.246331
Facundo Farías MIA KXMe8WD1Q6 32878 0.1069 0.080789

From there, you can aggregate these numbers up to the competition-season level, create percentiles, etc. This is the power of Glass Onion: it handles the particulars of player synchronization for you, allowing you to focus on the higher order tasks of player evaluation.

How does Glass Onion work under the hood?

In general, Glass Onion takes a list of SyncableContent objects, generates all possible combinations of pairs from that list, and uses the logic in a specific SyncEngine subclass to synchronize one pair at a time. The results of all pairs are then merged and deduplicated. 

At release, Glass Onion supports three object types (teams, matches, and players), each with a corresponding SyncEngine subclass. As laid out in this UnravelSports blog post, each object type depends on some other “higher-order object type” to provide unified context and/or limit the search space for synchronization:

  • Teams: competitions and seasons

  • Matches: competitions, seasons, and teams (home/away)

  • Players: teams and matches

The logic involved in synchronizing these objects is as follows (see our Methodology docs for the latest):

  • Teams: 

    • Attempt to join pair simply on team_name (along with competition_id and season_id, if desired).

    • With remaining records, attempt to match via cosine similarity using a minimum threshold of 75% similarity.

  • Matches: 

    • Attempt to join pair using match_date, home_team_id, and away_team_id (along with competition_id and season_id, if desired).

    • Account for matches with different dates across data providers (timezones, TV scheduling, etc) by adjusting match_date in one dataset in the pair by -3 to 3 days, then attempting synchronization using match_date, home_team_id, and away_team_id again. This process is then repeated for the other dataset in the pair.

    • Account for matches postponed to a different date outside the [-3, 3] day range by attempting synchronization using matchday, home_team_id, and away_team_id.

  • Players:

    • Attempt to join pair using player_name with a minimum 75% cosine similarity threshold for player name. Additionally, require that jersey_number and team_id are equal for matches that meet the similarity threshold.

    • Account for players with different birth dates across data providers (timezones, human error, etc.) by adjusting birth_date in one dataset in the pair by -1 to 1 days and/or swapping the month and day, then attempting synchronization using birth_date, team_id, and a combination of player_name and player_nickname. This process is then repeated for the other dataset in the pair.

    • Attempt to join remaining records using combinations of player_name and player_nickname with a minimum 75% cosine similarity threshold for player name. Additionally, require that team_id is equal for matches that meet the similarity threshold.

    • Attempt to join remaining records using "naive similarity": looking for normalized parts of one record's player name (or player_nickname) that exist in another's. Additionally, require that team_id is equal for matches found via this method.

    • Attempt to join remaining records using combinations of player_name and player_nickname with no minimum cosine similarity threshold. Additionally, require that team_id is equal. 

How does Glass Onion achieve our core tenets for an optimal synchronization solution?

  • Provider-agnostic: objects don’t get synchronized to any single source of truth, nor is the first dataset in the list provided to a SyncEngine the target to build off. Glass Onion synchronizes every possible pair of datasets provided to ensure total coverage.

  • Flexible: Glass Onion has robust layers of synchronization logic to evaluate edge case after edge case and ensure that the produced synchronized dataframes are as complete as possible

  • Extensible: we’ve designed the primary classes of Glass Onion (SyncEngine and SyncableContent) to be as generalizable as possible and easy to subclass, so that anyone else can add bespoke handling for other object types (both in and outside of soccer or sport). 

Identifier synchronization isn’t the most exciting, bleeding-edge work of soccer analytics, but it’s vital to successful data-driven recruitment and match analysis. Solving these sorts of platform-level issues helps the rest of an analytics team “solve soccer”. But traditionally, once someone solves one of these fundamental problems of soccer analytics, they close ranks: they have a competitive advantage now, why risk it?

We want to change that: providing off-the-shelf solutions to problems like this helps everyone level up their game. We see this package as our first step in a robust relationship with the global soccer data ecosystem. One of U.S. Soccer’s strategic pillars is to Grow the Game: our team’s release of software like Glass Onion levels the playing field for hobbyists and practitioners across the ecosystem, lowers the barrier of entry for newcomers to complex, cohesive data analysis, and enables analysts at all levels to do their best work. 

Documentation for the package can be found at the US Soccer Federation’s Github: https://ussoccerfederation.github.io/glass_onion/ and more information on the methodology can be found here: https://ussoccerfederation.github.io/glass_onion/methodology/

Editor’s Note: If you want to hear Akshay talking about the package, how they made it, and their motivation for doing so - as well as his time so far at US Soccer - head to the ASA show podcast feed, available wherever you get your podcasts.