Every soccer data analysis group has had the same jump-scare: they sign data provider contract #2, and suddenly, they need a solution to link teams, matches, and players across their data ecosystem. Individual data sources are easy to ingest: you can tuck them in their own little schemas, write code specific to their little universes, and rest easy that changes won’t come thick and fast without proper notice from your providers. But with those individual data sources come individual standards for tracking objects (players, teams, or matches). In most (if not all) major American sports, working across providers is easy: just use an object’s single source of truth identifier from the league itself. But as every analyst finds out very quickly, there is no public single source of truth identifier for teams, matches, and players in international soccer.

Thus, often-single-person data analytics departments have a massive data engineering challenge on their hands. For the 30 players on a senior roster and just two data sources, they could probably get away with updating a spreadsheet or database table manually. But what about the players on the second team? Those in the academy? Those in future acquisitions? What happens if players get married and change their names? Suddenly, a once very, very small player universe has gotten very, very, very large indeed, and practitioners must have something in place ASAP for the next window or match-week, come hell or high water. Does it matter if that solution is unwieldly long term? Not to the director of recruitment or high performance: they just care that the metrics they care about show up in the club web app for every player.

Our group at US Soccer (the Data Analytics team in High Performance) ran into this challenge last year when blending insights from our event and tracking data pipelines. In our preliminary research, we came across two pieces of prior art:

This 2022 blog post by @UnravelSports kickstarted a discussion on how to do player synchronization “right”, evaluating the use of cosine similarity and emphasizing the use of “higher order objects” to synchronize others: unifying teams unlocks the ability to synchronize matches, which unlocks player-level synchronization.
Parma Calcio (who should be commended for being the first organization we’d seen talk about this problem publicly!) offered a solution to this problem for just players using cosine similarity and birth dates.

Blending these two sources of inspiration with our experience at clubs and with various data vendors, we asked ourselves: what does a solution in this space need to do to simplify such a very complex problem across common object types in soccer? We believe an optimal solution should be:

Provider-agnostic: there is no single source of truth, so establish some commonality and synchronize everything to everything else
Flexible: different data sources have different levels of coverage fidelity
Extensible: anyone can add new object types easily

We believe we’ve achieved all three of these criteria with the release of our new open-source project: Glass Onion.

How can I use Glass Onion?

Let’s say you want to build an attacker profile using ASA’s receiving G+ and Statsbomb’s shot xG and you have these two dataframes of metrics for a given match:

    
            asa_player_id
            player_name
            player_nickname
            team_id
            team_name
            raw_receiving_g+ 
        

    
            KAqBNEpWqb
            Leonardo Campana
            Leonardo Campana
            MIA
            Inter Miami CF
            0.2458 
        

            KXMe8WD1Q6
            Facundo Farías
            Facundo Farías
            MIA
            Inter Miami CF
            0.1069 
        

            KXMe8gzPQ6
            Luquinhas
            Luquinhas
            NYRB
            New York Red Bulls
            0.0417 
        

            Oa5wdYB9q1
            Elias Manoel
            Elias Manoel
            NYRB
            New York Red Bulls
            0.0563 
        

            Oa5wdkVYq1
            Jorge Cabezas
            Jorge Cabezas
            NYRB
            New York Red Bulls
            0.006 
        

            OlMl9jG05L
            Cory Burke
            Cory Burke
            NYRB
            New York Red Bulls
            0.0019 
        

            Pk5Lgax7MO
            Robbie Robinson
            Robbie Robinson
            MIA
            Inter Miami CF
            0.0072 
        

            eVq3jBjj5W
            Tom Barlow
            Tom Barlow
            NYRB
            New York Red Bulls
            0.1236 
        

            eVq3y38yMW
            Benjamin Cremaschi
            Benjamin Cremaschi
            MIA
            Inter Miami CF
            0.0123 
        

            vzqo78BJqa
            Omir Fernández
            Omir Fernández
            NYRB
            New York Red Bulls
            0.1843 
        

            vzqoJrEk5a
            Diego Gómez
            Diego Gómez
            MIA
            Inter Miami CF
            0.0847 
        

            vzqowm7qap
            Lionel Messi
            Lionel Messi
            MIA
            Inter Miami CF
            0.2935
        

    
            statsbomb_player_id
            player_name
            player_nickname
            team_id
            team_name
            statsbomb_shot_xg 
        

    
            5503
            Lionel Andrés Messi Cuccittini
            Lionel Andrés Messi Cuccittini
            MIA
            Inter Miami
            0.961997 
        

            32878
            Facundo Farías
            Facundo Farías
            MIA
            Inter Miami
            0.080789 
        

            36149
            Robbie Robinson Belmar
            Robbie Robinson Belmar
            MIA
            Inter Miami
            0 
        

            41824
            Leonardo Campana Romero
            Leonardo Campana Romero
            MIA
            Inter Miami
            0.252315 
        

            222155
            Diego Alexander Gómez Amarilla
            Diego Alexander Gómez Amarilla
            MIA
            Inter Miami
            0.088681 
        

            225209
            Benjamin Cremaschi
            Benjamin Cremaschi
            MIA
            Inter Miami
            0 
        

            12360
            Cory Burke
            Cory Burke
            NYRB
            New York Red Bulls
            0 
        

            23304
            Lucas Lima Linhares
            Lucas Lima Linhares
            NYRB
            New York Red Bulls
            0.048729 
        

            23859
            Omir Guadalupe Fernandez Mosso
            Omir Guadalupe Fernandez Mosso
            NYRB
            New York Red Bulls
            0.417629 
        

            24955
            Tom Barlow
            Tom Barlow
            NYRB
            New York Red Bulls
            0.246331 
        

            112990
            Elias Manoel Alves de Paula
            Elias Manoel Alves de Paula
            NYRB
            New York Red Bulls
            0 
        

            379652
            Peter Stroud
            Peter Stroud
            NYRB
            New York Red Bulls
            0
        

You can use Glass Onion to synchronize players like so…

  
    asa_content = PlayerSyncableContent(
    provider="asa",
    data=asa_player_df.drop(["raw_receiving_g+"], axis=1)
)

statsbomb_content = PlayerSyncableContent(
    provider="statsbomb",
    data=statsbomb_player_df.drop(["statsbomb_shot_xg"], axis=1)
)
engine = PlayerSyncEngine(
    content=[asa_content, statsbomb_content],
    verbose=True
)
result = engine.synchronize()
result.data

  
  

    
            team_id
            player_name
            asa_player_id
            statsbomb_player_id
            provider 
        

    
            MIA
            Benjamin Cremaschi
            eVq3y38yMW
            225209
            asa 
        

            NYRB
            Cory Burke
            OlMl9jG05L
            12360
            asa 
        

            MIA
            Diego Gómez
            vzqoJrEk5a
            222155
            asa 
        

            NYRB
            Elias Manoel
            Oa5wdYB9q1
            112990
            asa 
        

            MIA
            Facundo Farías
            KXMe8WD1Q6
            32878
            asa 
        

            NYRB
            Jorge Cabezas
            Oa5wdkVYq1
            379652
            asa 
        

            MIA
            Leonardo Campana
            KAqBNEpWqb
            41824
            asa 
        

            MIA
            Lionel Messi
            vzqowm7qap
            5503
            asa 
        

            NYRB
            Luquinhas
            KXMe8gzPQ6
            23304
            asa 
        

            NYRB
            Omir Fernández
            vzqo78BJqa
            23859
            asa 
        

            MIA
            Robbie Robinson
            Pk5Lgax7MO
            36149
            asa 
        

            NYRB
            Tom Barlow
            eVq3jBjj5W
            24955
            asa
        

…so you can produce this table with both metrics side by side:

  
    composite = pd.merge(result.data[["player_name", "team_id", "asa_player_id", "statsbomb_player_id"]], asa_player_df[["asa_player_id", "raw_receiving_g+"]], on="asa_player_id")
composite = pd.merge(composite, statsbomb_player_df[["statsbomb_player_id", "statsbomb_shot_xg"]], on="statsbomb_player_id")
composite.sort_values(by="raw_receiving_g+", ascending=False).head()

    
            player_name
            team_id
            asa_player_id
            statsbomb_player_id
            raw_receiving_g+
            statsbomb_shot_xg 
        

    
            Lionel Messi
            MIA
            vzqowm7qap
            5503
            0.2935
            0.961997 
        

            Leonardo Campana
            MIA
            KAqBNEpWqb
            41824
            0.2458
            0.252315 
        

            Omir Fernández
            NYRB
            vzqo78BJqa
            23859
            0.1843
            0.417629 
        

            Tom Barlow
            NYRB
            eVq3jBjj5W
            24955
            0.1236
            0.246331 
        

            Facundo Farías
            MIA
            KXMe8WD1Q6
            32878
            0.1069
            0.080789
        

From there, you can aggregate these numbers up to the competition-season level, create percentiles, etc. This is the power of Glass Onion: it handles the particulars of player synchronization for you, allowing you to focus on the higher order tasks of player evaluation.

How does Glass Onion work under the hood?

In general, Glass Onion takes a list of SyncableContent objects, generates all possible combinations of pairs from that list, and uses the logic in a specific SyncEngine subclass to synchronize one pair at a time. The results of all pairs are then merged and deduplicated.

At release, Glass Onion supports three object types (teams, matches, and players), each with a corresponding SyncEngine subclass. As laid out in this UnravelSports blog post, each object type depends on some other “higher-order object type” to provide unified context and/or limit the search space for synchronization:

Teams: competitions and seasons
Matches: competitions, seasons, and teams (home/away)
Players: teams and matches

The logic involved in synchronizing these objects is as follows (see our Methodology docs for the latest):

Teams:

Attempt to join pair simply on team_name (along with competition_id and season_id, if desired).
With remaining records, attempt to match via cosine similarity using a minimum threshold of 75% similarity.

Matches:

Attempt to join pair using match_date, home_team_id, and away_team_id (along with competition_id and season_id, if desired).
Account for matches with different dates across data providers (timezones, TV scheduling, etc) by adjusting match_date in one dataset in the pair by -3 to 3 days, then attempting synchronization using match_date, home_team_id, and away_team_id again. This process is then repeated for the other dataset in the pair.
Account for matches postponed to a different date outside the [-3, 3] day range by attempting synchronization using matchday, home_team_id, and away_team_id.

Players:

Attempt to join pair using player_name with a minimum 75% cosine similarity threshold for player name. Additionally, require that jersey_number and team_id are equal for matches that meet the similarity threshold.
Account for players with different birth dates across data providers (timezones, human error, etc.) by adjusting birth_date in one dataset in the pair by -1 to 1 days and/or swapping the month and day, then attempting synchronization using birth_date, team_id, and a combination of player_name and player_nickname. This process is then repeated for the other dataset in the pair.
Attempt to join remaining records using combinations of player_name and player_nickname with a minimum 75% cosine similarity threshold for player name. Additionally, require that team_id is equal for matches that meet the similarity threshold.
Attempt to join remaining records using "naive similarity": looking for normalized parts of one record's player name (or player_nickname) that exist in another's. Additionally, require that team_id is equal for matches found via this method.
Attempt to join remaining records using combinations of player_name and player_nickname with no minimum cosine similarity threshold. Additionally, require that team_id is equal.

How does Glass Onion achieve our core tenets for an optimal synchronization solution?

Provider-agnostic: objects don’t get synchronized to any single source of truth, nor is the first dataset in the list provided to a SyncEngine the target to build off. Glass Onion synchronizes every possible pair of datasets provided to ensure total coverage.
Flexible: Glass Onion has robust layers of synchronization logic to evaluate edge case after edge case and ensure that the produced synchronized dataframes are as complete as possible
Extensible: we’ve designed the primary classes of Glass Onion (SyncEngine and SyncableContent) to be as generalizable as possible and easy to subclass, so that anyone else can add bespoke handling for other object types (both in and outside of soccer or sport).

Identifier synchronization isn’t the most exciting, bleeding-edge work of soccer analytics, but it’s vital to successful data-driven recruitment and match analysis. Solving these sorts of platform-level issues helps the rest of an analytics team “solve soccer”. But traditionally, once someone solves one of these fundamental problems of soccer analytics, they close ranks: they have a competitive advantage now, why risk it?

We want to change that: providing off-the-shelf solutions to problems like this helps everyone level up their game. We see this package as our first step in a robust relationship with the global soccer data ecosystem. One of U.S. Soccer’s strategic pillars is to Grow the Game: our team’s release of software like Glass Onion levels the playing field for hobbyists and practitioners across the ecosystem, lowers the barrier of entry for newcomers to complex, cohesive data analysis, and enables analysts at all levels to do their best work.

Documentation for the package can be found at the US Soccer Federation’s Github: https://ussoccerfederation.github.io/glass_onion/ and more information on the methodology can be found here: https://ussoccerfederation.github.io/glass_onion/methodology/

Editor’s Note: If you want to hear Akshay talking about the package, how they made it, and their motivation for doing so - as well as his time so far at US Soccer - head to the ASA show podcast feed, available wherever you get your podcasts.

American Soccer Analysis

Introducing Glass Onion by US Soccer, An Identifier Synchronization Tool

How can I use Glass Onion?

How does Glass Onion work under the hood?

How does Glass Onion achieve our core tenets for an optimal synchronization solution?