Introducing Glass Onion by US Soccer, An Identifier Synchronization Tool
/Every soccer data analysis group has had the same jump-scare: they sign data provider contract #2, and suddenly, they need a solution to link teams, matches, and players across their data ecosystem. Individual data sources are easy to ingest: you can tuck them in their own little schemas, write code specific to their little universes, and rest easy that changes won’t come thick and fast without proper notice from your providers. But with those individual data sources come individual standards for tracking objects (players, teams, or matches). In most (if not all) major American sports, working across providers is easy: just use an object’s single source of truth identifier from the league itself. But as every analyst finds out very quickly, there is no public single source of truth identifier for teams, matches, and players in international soccer.
Thus, often-single-person data analytics departments have a massive data engineering challenge on their hands. For the 30 players on a senior roster and just two data sources, they could probably get away with updating a spreadsheet or database table manually. But what about the players on the second team? Those in the academy? Those in future acquisitions? What happens if players get married and change their names? Suddenly, a once very, very small player universe has gotten very, very, very large indeed, and practitioners must have something in place ASAP for the next window or match-week, come hell or high water. Does it matter if that solution is unwieldly long term? Not to the director of recruitment or high performance: they just care that the metrics they care about show up in the club web app for every player.
Our group at US Soccer (the Data Analytics team in High Performance) ran into this challenge last year when blending insights from our event and tracking data pipelines. In our preliminary research, we came across two pieces of prior art:
This 2022 blog post by @UnravelSports kickstarted a discussion on how to do player synchronization “right”, evaluating the use of cosine similarity and emphasizing the use of “higher order objects” to synchronize others: unifying teams unlocks the ability to synchronize matches, which unlocks player-level synchronization.
Parma Calcio (who should be commended for being the first organization we’d seen talk about this problem publicly!) offered a solution to this problem for just players using cosine similarity and birth dates.
Blending these two sources of inspiration with our experience at clubs and with various data vendors, we asked ourselves: what does a solution in this space need to do to simplify such a very complex problem across common object types in soccer? We believe an optimal solution should be:
Provider-agnostic: there is no single source of truth, so establish some commonality and synchronize everything to everything else
Flexible: different data sources have different levels of coverage fidelity
Extensible: anyone can add new object types easily
We believe we’ve achieved all three of these criteria with the release of our new open-source project: Glass Onion.
How can I use Glass Onion?
Let’s say you want to build an attacker profile using ASA’s receiving G+ and Statsbomb’s shot xG and you have these two dataframes of metrics for a given match:
| asa_player_id | player_name | player_nickname | team_id | team_name | raw_receiving_g+ |
|---|---|---|---|---|---|
| KAqBNEpWqb | Leonardo Campana | Leonardo Campana | MIA | Inter Miami CF | 0.2458 |
| KXMe8WD1Q6 | Facundo Farías | Facundo Farías | MIA | Inter Miami CF | 0.1069 |
| KXMe8gzPQ6 | Luquinhas | Luquinhas | NYRB | New York Red Bulls | 0.0417 |
| Oa5wdYB9q1 | Elias Manoel | Elias Manoel | NYRB | New York Red Bulls | 0.0563 |
| Oa5wdkVYq1 | Jorge Cabezas | Jorge Cabezas | NYRB | New York Red Bulls | 0.006 |
| OlMl9jG05L | Cory Burke | Cory Burke | NYRB | New York Red Bulls | 0.0019 |
| Pk5Lgax7MO | Robbie Robinson | Robbie Robinson | MIA | Inter Miami CF | 0.0072 |
| eVq3jBjj5W | Tom Barlow | Tom Barlow | NYRB | New York Red Bulls | 0.1236 |
| eVq3y38yMW | Benjamin Cremaschi | Benjamin Cremaschi | MIA | Inter Miami CF | 0.0123 |
| vzqo78BJqa | Omir Fernández | Omir Fernández | NYRB | New York Red Bulls | 0.1843 |
| vzqoJrEk5a | Diego Gómez | Diego Gómez | MIA | Inter Miami CF | 0.0847 |
| vzqowm7qap | Lionel Messi | Lionel Messi | MIA | Inter Miami CF | 0.2935 |
| statsbomb_player_id | player_name | player_nickname | team_id | team_name | statsbomb_shot_xg |
|---|---|---|---|---|---|
| 5503 | Lionel Andrés Messi Cuccittini | Lionel Andrés Messi Cuccittini | MIA | Inter Miami | 0.961997 |
| 32878 | Facundo Farías | Facundo Farías | MIA | Inter Miami | 0.080789 |
| 36149 | Robbie Robinson Belmar | Robbie Robinson Belmar | MIA | Inter Miami | 0 |
| 41824 | Leonardo Campana Romero | Leonardo Campana Romero | MIA | Inter Miami | 0.252315 |
| 222155 | Diego Alexander Gómez Amarilla | Diego Alexander Gómez Amarilla | MIA | Inter Miami | 0.088681 |
| 225209 | Benjamin Cremaschi | Benjamin Cremaschi | MIA | Inter Miami | 0 |
| 12360 | Cory Burke | Cory Burke | NYRB | New York Red Bulls | 0 |
| 23304 | Lucas Lima Linhares | Lucas Lima Linhares | NYRB | New York Red Bulls | 0.048729 |
| 23859 | Omir Guadalupe Fernandez Mosso | Omir Guadalupe Fernandez Mosso | NYRB | New York Red Bulls | 0.417629 |
| 24955 | Tom Barlow | Tom Barlow | NYRB | New York Red Bulls | 0.246331 |
| 112990 | Elias Manoel Alves de Paula | Elias Manoel Alves de Paula | NYRB | New York Red Bulls | 0 |
| 379652 | Peter Stroud | Peter Stroud | NYRB | New York Red Bulls | 0 |
You can use Glass Onion to synchronize players like so…
asa_content = PlayerSyncableContent(
provider="asa",
data=asa_player_df.drop(["raw_receiving_g+"], axis=1)
)
statsbomb_content = PlayerSyncableContent(
provider="statsbomb",
data=statsbomb_player_df.drop(["statsbomb_shot_xg"], axis=1)
)
engine = PlayerSyncEngine(
content=[asa_content, statsbomb_content],
verbose=True
)
result = engine.synchronize()
result.data
| team_id | player_name | asa_player_id | statsbomb_player_id | provider |
|---|---|---|---|---|
| MIA | Benjamin Cremaschi | eVq3y38yMW | 225209 | asa |
| NYRB | Cory Burke | OlMl9jG05L | 12360 | asa |
| MIA | Diego Gómez | vzqoJrEk5a | 222155 | asa |
| NYRB | Elias Manoel | Oa5wdYB9q1 | 112990 | asa |
| MIA | Facundo Farías | KXMe8WD1Q6 | 32878 | asa |
| NYRB | Jorge Cabezas | Oa5wdkVYq1 | 379652 | asa |
| MIA | Leonardo Campana | KAqBNEpWqb | 41824 | asa |
| MIA | Lionel Messi | vzqowm7qap | 5503 | asa |
| NYRB | Luquinhas | KXMe8gzPQ6 | 23304 | asa |
| NYRB | Omir Fernández | vzqo78BJqa | 23859 | asa |
| MIA | Robbie Robinson | Pk5Lgax7MO | 36149 | asa |
| NYRB | Tom Barlow | eVq3jBjj5W | 24955 | asa |
…so you can produce this table with both metrics side by side:
composite = pd.merge(result.data[["player_name", "team_id", "asa_player_id", "statsbomb_player_id"]], asa_player_df[["asa_player_id", "raw_receiving_g+"]], on="asa_player_id") composite = pd.merge(composite, statsbomb_player_df[["statsbomb_player_id", "statsbomb_shot_xg"]], on="statsbomb_player_id") composite.sort_values(by="raw_receiving_g+", ascending=False).head()
| player_name | team_id | asa_player_id | statsbomb_player_id | raw_receiving_g+ | statsbomb_shot_xg |
|---|---|---|---|---|---|
| Lionel Messi | MIA | vzqowm7qap | 5503 | 0.2935 | 0.961997 |
| Leonardo Campana | MIA | KAqBNEpWqb | 41824 | 0.2458 | 0.252315 |
| Omir Fernández | NYRB | vzqo78BJqa | 23859 | 0.1843 | 0.417629 |
| Tom Barlow | NYRB | eVq3jBjj5W | 24955 | 0.1236 | 0.246331 |
| Facundo Farías | MIA | KXMe8WD1Q6 | 32878 | 0.1069 | 0.080789 |
From there, you can aggregate these numbers up to the competition-season level, create percentiles, etc. This is the power of Glass Onion: it handles the particulars of player synchronization for you, allowing you to focus on the higher order tasks of player evaluation.
How does Glass Onion work under the hood?
In general, Glass Onion takes a list of SyncableContent objects, generates all possible combinations of pairs from that list, and uses the logic in a specific SyncEngine subclass to synchronize one pair at a time. The results of all pairs are then merged and deduplicated.
At release, Glass Onion supports three object types (teams, matches, and players), each with a corresponding SyncEngine subclass. As laid out in this UnravelSports blog post, each object type depends on some other “higher-order object type” to provide unified context and/or limit the search space for synchronization:
Teams: competitions and seasons
Matches: competitions, seasons, and teams (home/away)
Players: teams and matches
The logic involved in synchronizing these objects is as follows (see our Methodology docs for the latest):
Teams:
Attempt to join pair simply on team_name (along with competition_id and season_id, if desired).
With remaining records, attempt to match via cosine similarity using a minimum threshold of 75% similarity.
Matches:
Attempt to join pair using match_date, home_team_id, and away_team_id (along with competition_id and season_id, if desired).
Account for matches with different dates across data providers (timezones, TV scheduling, etc) by adjusting match_date in one dataset in the pair by -3 to 3 days, then attempting synchronization using match_date, home_team_id, and away_team_id again. This process is then repeated for the other dataset in the pair.
Account for matches postponed to a different date outside the [-3, 3] day range by attempting synchronization using matchday, home_team_id, and away_team_id.
Players:
Attempt to join pair using player_name with a minimum 75% cosine similarity threshold for player name. Additionally, require that jersey_number and team_id are equal for matches that meet the similarity threshold.
Account for players with different birth dates across data providers (timezones, human error, etc.) by adjusting birth_date in one dataset in the pair by -1 to 1 days and/or swapping the month and day, then attempting synchronization using birth_date, team_id, and a combination of player_name and player_nickname. This process is then repeated for the other dataset in the pair.
Attempt to join remaining records using combinations of player_name and player_nickname with a minimum 75% cosine similarity threshold for player name. Additionally, require that team_id is equal for matches that meet the similarity threshold.
Attempt to join remaining records using "naive similarity": looking for normalized parts of one record's player name (or player_nickname) that exist in another's. Additionally, require that team_id is equal for matches found via this method.
Attempt to join remaining records using combinations of player_name and player_nickname with no minimum cosine similarity threshold. Additionally, require that team_id is equal.
How does Glass Onion achieve our core tenets for an optimal synchronization solution?
Provider-agnostic: objects don’t get synchronized to any single source of truth, nor is the first dataset in the list provided to a SyncEngine the target to build off. Glass Onion synchronizes every possible pair of datasets provided to ensure total coverage.
Flexible: Glass Onion has robust layers of synchronization logic to evaluate edge case after edge case and ensure that the produced synchronized dataframes are as complete as possible
Extensible: we’ve designed the primary classes of Glass Onion (SyncEngine and SyncableContent) to be as generalizable as possible and easy to subclass, so that anyone else can add bespoke handling for other object types (both in and outside of soccer or sport).
Identifier synchronization isn’t the most exciting, bleeding-edge work of soccer analytics, but it’s vital to successful data-driven recruitment and match analysis. Solving these sorts of platform-level issues helps the rest of an analytics team “solve soccer”. But traditionally, once someone solves one of these fundamental problems of soccer analytics, they close ranks: they have a competitive advantage now, why risk it?
We want to change that: providing off-the-shelf solutions to problems like this helps everyone level up their game. We see this package as our first step in a robust relationship with the global soccer data ecosystem. One of U.S. Soccer’s strategic pillars is to Grow the Game: our team’s release of software like Glass Onion levels the playing field for hobbyists and practitioners across the ecosystem, lowers the barrier of entry for newcomers to complex, cohesive data analysis, and enables analysts at all levels to do their best work.
Documentation for the package can be found at the US Soccer Federation’s Github: https://ussoccerfederation.github.io/glass_onion/ and more information on the methodology can be found here: https://ussoccerfederation.github.io/glass_onion/methodology/
Editor’s Note: If you want to hear Akshay talking about the package, how they made it, and their motivation for doing so - as well as his time so far at US Soccer - head to the ASA show podcast feed, available wherever you get your podcasts.
