Formations are a lie, for the most part. We all know this by now; learning a team’s formation generally tells you very little about how they play. One reason for this is that positions are also a lie. Nani and Diego Rossi are both wingers on paper, but anyone who watches the two knows that they play very differently. We’ve made more specific terms, like inverted winger, to help describe the difference. But what if you hadn’t seen a player play yet? What if you’d like some objective way to define a player’s role beyond just their position? Wouldn’t it be nice if we had a data-driven way of determining a playstyle that we could use to give us an idea of how a winger...wings?

With a clustering analysis that might be possible. In a simple world, a clustering analysis takes a bunch of data and sorts it into groups called clusters. That makes it perfect for what we’re trying to do here - we can, in theory, take a bunch of stats for every MLS outfield player and form groups that represent different roles or playstyles. This would allow us to separate the Rossi’s from the Nani’s. Quantifying defensive contributions with statistics is its own mountain of difficulties, so it’s easiest to start trying to do this with attacking stats.

In the real, not-so-simple world, clustering a dataset ends up being a series of ambiguous decisions. There are a lot of clustering methods and it’s often not clear which is the best for a given problem. On top of that, it’s the analysts’ job to figure out how many clusters to pull out of a dataset and what these clusters actually tell us - neither of these tasks are always straightforward either. How we go about making these decisions often depends on what we want our clusters to tell us.

Criteria for good attacking role clusters

First and foremost, our clusters need to be interpretable. We should be able to look at each of our clusters and be able to tell what stats define the role of players in that cluster. This also means that the roles we define would ideally match our intuition for the roles that most fans and pundits see as contributing to an attack.

Second, our clusters should tell us more than just the player’s on-paper position. Again, if our clusters just tell us ‘yup, that’s a winger’, they’re not helpful. That said, it would also be great if our clusters line up roughly with positions - if we have a cluster that is half CBs and half CAMs, the role of the cluster will likely be hard to interpret as it probably won’t match our intuition for how players contribute to attacks.

Finally, we want our clusters to be relatively stable. If most players end up in a different role from year-to-year, the labels probably aren’t going to be helpful moving forward. To test stability, we can examine two years of data - the 2018 and 2019 MLS seasons - and calculate what percent of players who were in the league both years end up in the same role. However, we don’t want this percentage to be too high - if the roles never change, they aren’t sensitive enough to improvements, declines, position/tactical changes, etc. to be really helpful. The Goldilocks number that’s “just right” here will be somewhat subjective.

Choosing the data

It’s generally a good idea to normalize counting stats to per 90s; here it’ll help us avoid clustering players into starters vs. bench players. I also went with 1200 minutes in a given season as a cutoff for inclusion in the analysis to avoid any weird patterns resulting from infrequent data for any important stats. Lastly, I was limited to publicly available stats.

Aside from those basic guidelines, the final statistics used in the analysis ended up being largely a result of trial-and-error to see what information gives us clusters that meet the criteria above. I decided only to include statistics from open play - including penalties would skyrocket the xG for penalty-takers, including set pieces would inflate crosses for corner takers, etc. Including things like vertical passing distance (more likely for players farther back on the pitch) and miscontrolled passes (more likely for forwards who are under more pressure) allowed us to indirectly include some pitch position information without taking too much focus away from playstyle.

The final clustering ended up using 19 statistics: shots, non-penalty expected goals (npxG), % of possession chains in which a player participated with a shot, key passes (KP), expected assists (xA), % of chains in which a player participated with a KP, xGChain, xBuildup (xB), expected pass percentage (xPass%), Pass%, progressive passing distance, crosses, crosses into the penalty area, successful dribbles, vertical passing distance, miscontrolled passes, times targeted by a pass, % of player passes that were short, and % of player passes that were long.

Data preprocessing and clustering

Preprocessing the data in a couple of ways can really help us define clusters. The first is scaling and centering the data - this is necessary to make sure that some methods don’t consider stats on different scales (for example, percentages vs. counting stats) to be weighted differently. The second is dimension reduction, accomplished here using tSNE. How tSNE works exactly is beyond my expertise to explain, although Eliot McKinley and Cheuk Hei Ho’s article is a good resource if you’d like to learn more. It works well here though, and that’s in part because tSNE involves a perplexity parameter that we can change to fit our needs. Generally speaking, larger perplexity numbers will help focus on macro-level trends and smaller numbers will help focus on micro-level trends. Here, focusing on macro-level trends will give us simple attack/midfield/defense groups, so we can tune the perplexity parameter to help get us closer to focusing on the micro-level trends that yield more interesting and informative clusters.

How do we know if dimension reduction helps? One way is to measure the clusterability of the data with the Hopkins statistic. The closer to one the Hopkins statistic is, the more amenable to clustering the data is. The tSNE-reduced data was much more amenable to clustering compared to the non-reduced data and data reduced using PCA (another common dimension reduction method).

This statistic can also help us decide how many clusters to use. There’s a clear spike at ten clusters here, so that was a natural place to start. However, the ten-group clustering I ended up with included only one role for virtually all strikers, which was not ideal. Expanding to eleven clusters allowed for more specificity in attacking positions without sacrificing much in the way of cluster stability across seasons.

Finally, we have to decide on a clustering method. I compared the results of a number of methods on year-to-year stability, cluster positional diversity, and cluster interpretability. I ended up settling on a hybrid hierarchical/k-means method - this involves first clustering the data using a hierarchical approach, then extracting the centers of the hierarchical clusters and re-clustering the data with a k-means algorithm using those centers. The resulting clusters were the most stable of the all the methods I tried (74% of players that were in the league both seasons remained in the same cluster), interpretable, and provided at least two clusters per on-paper position.

Turning a cluster into a role

While a clustering analysis will sort players into groups, it’s on us as the analysts to figure out what each of these clusters mean. While the tSNE reduction we used is great for helping create clusters, the dimensions it pulls out of the data are essentially uninterpretable, so we can’t use those to help define our clusters.

The simplest, and most thorough, way to interpret these clusters is to examine the distribution of each original statistic within the cluster. This allows us to see exactly which of our original 19 statistics help define each cluster. The strength of this approach is also the problem with it - it’s so much information that it can be hard to express graphically, which is particularly an issue when trying to communicate the features of each role.

To help, we can go back to a dimension reduction method I skimmed over earlier: PCA. PCA reductions give us more interpretable dimensions than tSNE - although those dimensions didn’t help us cluster the data as much as tSNE-reduced dimensions, they can help simplify interpretation.

A standard PCA yields dimensions that are not correlated with each other, but in our case we expect that many dimensions are correlated with each other - for example, dimensions that define a backfield player are probably inversely correlated with dimensions that define a striker. By applying a promax rotation to the dimensions, we end up with clusters that are allowed to be correlated with each other and yield 8 dimensions that we’ll use to interpret the clusters. The PCA dimensions, with individual variable contributions simplified a bit for clarity, are:

Finishing- high numbers of shots, npxG, % of chains with a shot
Creating - high numbers of KP, xA, and % of chains with a KP
Dribbling - many successful dribbles
Involvement - targeted often by passes, many progressive passes
Crossing - many crosses and crosses into the penalty area
Buildup - high xB and xGChain
Ball Retention - high pass completion % and expected pass completion %
Verticality - many long passes, few short passes, high vertical passing distance, many progressive passes

It’s important to keep in mind moving forward that these dimensions were not directly used to create the clusters, they’re solely for interpretation and visualization purposes. It’s also good practice to interpret the clusters using the more thorough raw data approach to make sure the ‘story’ the raw data tells us about the clusters matches the story the PCA dimensions tell us - while I did do this and I’ll discuss some specific variables in the description of each role, I won’t graph each individual variable for each cluster in this article for the sake of brevity.

The Roles

Pure Scorer

As the name suggests, these players are on the pitch to score goals. Their passing profiles suggest they are generally the farthest forward on the pitch and often under pressure from defenders - they play an above average percentage of short passes and backwards passes, a below average percentage of long passes, and they often miscontrol passes. Despite being so far forward, they don’t create all that often. Instead, they rate the highest (as a group) in npxG/90. These players are almost entirely center forwards, although a small number of very high scoring and very infrequent crossing wingers end up here as well.

Notable players: Josef Martinez, Chris Wondolowski, Raul Ruidiaz, Jozy Altidore

Notable exceptions: FC Cincinnati’s Allan Cruz is a clear standout, usually listed as a midfielder and not scoring as much as the rest of the players in this cluster. However, his passing profile fits that of an attacking-third player and he doesn’t create much. Cruz did lead Cinci in goals last year from the midfield while registering zero assists, so maybe the label isn’t a fluke.

Hybrid Scorer

These players are somewhere between a Wide Attacker and a Pure Scorer, although they tend to prioritize scoring over creating. Like Pure Scorers, they aren’t targeted by passes incredibly often, tend to play backwards passes and miscontrol passes often. However, they tend to cross more often and complete noticeably more dribbles than Pure Scorers. These players tend to be either a) wingers who cross less and shoot more than most wingers or b) strikers that take on defenders and create more than most strikers. Most players who you’d think of as inverted wingers get classified here.

Notable players: Zlatan, Diego Rossi, Jordan Morris, Jesus Ferreira

Notable exceptions: Ulises Segura, Lucas Veneto, Corey Baird and Juan Agudelo all profile like Hybrid Scorers despite not scoring very often. Alberth Ellis’ 0.29 xA/90 is unusually high for the group, maybe a borderline Playmaker.

Playmaker

The smallest and arguably most impactful group, these are players that spearhead the attack in the final third. Along with the deeper-lying Pivots, Playmakers are the most frequent target for their teammate’s passes. Playmakers play more long passes than the scorer groups and Wide Attackers, likely indicating a propensity for dropping into the middle third. They primarily create for others, exhibiting the highest KP and xA numbers of all roles. They also register well above-average in shots and npxG, although not as high as the players they typically create for. These players are generally CAMs, although some high-output wingers that see a lot of the ball (like Nani) will cluster here as well.

Notable players: Nico Lodeiro, Diego Valeri, Kaku

Notable exceptions: The exception to all rules, Carlos Vela shoots (and scores) much more often than other Playmakers. The sole forward in the group, Wayne Rooney was classified as a Playmaker rather than a scorer, probably due to his relatively vertical passing.

Wide Attacker

These players profile similarly to Playmakers, but with fewer touches and less progressive passing. Generally wingers, but sometimes attacking midfielders, these players also tend to get a good number of crosses into the box (hence the ‘Wide’ in Wide Attacker). They tend to create more, cross more and score less than either scorer role, but don’t create as much or see as much of the ball as a Playmaker. They average the riskiest passes of the four roles discussed so far, likely due to their propensity for crossing. This is the closest we get to a ‘classic’ winger.

Notable players: Nico Gaitan, Paul Arriola, Roland Lamah

Notable exceptions: Aleksandar Katai’s high shot and npxG numbers suggest he might be more accurately called a scorer, but his frequent crossing classifies him as a wide attacker. Memo Rodriguez’s .40 npxG/90 is also an outlier in this group.

Pivot

The passing profile of a Pivot suggests they operate deeper in the midfield - they’re above average in vertical pass distance and long pass % (reflected in scoring higher in the ‘Verticality’ dimension) and below average in miscontrolled passes. They typically aren’t as far back as a defender, though - these are usually CMs or CDMs. Despite usually being removed from the goalmouth action, Pivots receive more passes and progress the ball more than any other group - this is the type of midfielder that you expect every buildup chain to flow through, as evidenced by the high score on the ‘Buildup’ dimension.

Notable players: Michael Bradley, Jackson Yuiell, Jonathan Osorio, Marky Delgado, Jonathan Dos Santos

Notable exceptions: LAFC’s ridiculous 2019 goal-scoring rate breaks this cluster. The LAFC pair Lee Nguyen and Mark-Anthony Kaye played many more KP/90 than other Pivots, who normally contribute more indirectly to goals. Another LAFC player, Latif Blessing, also clusters here despite averaging almost 0.5 npxG/90 - an abnormally high number for a Pivot.

Recycler

The vertical passing profile of a Recycler is similar to a Pivot, indicating they operate in similar spaces. However, compared to Pivots, these players miscontrol fewer pases, don’t progress the ball as much or rate as highly in xB - their job is primarily to keep possession with high-percentage passes rather than launch attacks.

Notable players: Kaylin Acosta, Wil Trapp

Notable exceptions: Darlington Nagbe - although his xGChain and xA are abnormally high for the typical Recycler, his xB is not (likely why he isn’t labeled as a Pivot). This suggests that he contributes more directly to chance creation than a typical player with his passing/shooting profile might.

Support Attacker

The contribution of a Support Attacker is somewhere between a Pivot and a Playmaker, but with far less frequent involvement than either. This is the type of player that chips in with goals and assists occasionally, but it’s probably not their primary contribution to the team. They don’t shoot like scorers, cross like Wide Attackers, create like Playmakers, or contribute to goal buildup like Pivots, but all of these stats are often slightly above-average. This cluster is perhaps the hardest to interpret because no individual dimensions stand out - players in this cluster might benefit the most from the inclusion of defensive information to help define their overall role further.

Notable players: Paxton Pomykal, Benny Feilharber, Marc Rzatkowski, Djordje Mihailovic

Notable exceptions: Julian Gressel registers an abnormally high xA for this cluster. While Shuttlers are not generally the focal point of an attack, Federico Higuain clusters here despite being targeted by teammates’ passes very frequently - perhaps he’s better thought of as a low-output Playmaker.

Crossing Specialist

This cluster is virtually exclusively fullbacks. As the name suggests, the main job of these players is to whip crosses in. While a small handful might also chip in with shots (see notable exceptions), most don’t shoot the ball much. Their xPass% tends to be on the lower side, probably due to the low likelihood of completing those crosses.

Notable players: Graham Zusi, Jorge Villafana, Edgar Castillo

Notable exceptions: Ronald Matarrita, Jorge Moreira, Anton Tinnerholm and Hassani Dotson contribute an above-average number of shots despite clustering as Crossing Specialists. Diego Polenta ended up here despite being a CB who doesn’t really cross, and I couldn’t really tell you why. There’s always one.

Wide Support

Wide support players are also generally fullbacks, but they don’t cross the ball nearly as often. They’re usually below average in direct contributions to goals, almost never registering shots and playing few key passes. However, they manage higher percentage passes, fewer miscontrolled passes and better xB numbers than Crossing Specialists - this pattern suggests they serve to retain possession and contribute to buildup in wide areas up the pitch.

Notable players: Ryan Hollingshead, Seven Beitashour

Notable exceptions: Nick Lima and Juston Morrow create a decent amount of chances directly, which is uncharacteristic of this role. This suggests they either find ways to create chances from wide positions aside from crosses or are very efficient crossers.

Ball-Playing Defender

These players are almost exclusively CBs. Their high vertical pass distance, relatively high expected pass percentage, frequent long passes and infrequent miscontrolled passes tell us that these players tend to operate in the backfield where they aren’t frequently under defensive pressure. However, these players tend to register decent xB numbers and progress the ball very well while maintaining a high expected pass percentage. These are backfield players who can play passes that don’t risk losing possession, but also frequently contribute to build up play.

Notable players: Bastian Schweinsteiger, Walker Zimmerman, Matt Besler

Notable exceptions: Seattle’s Gustav Svensson gets labeled as a Ball-Playing Defender despite generally playing as a defensive midfielder on paper.

Backfield Outlet

Like Ball-Playing Defenders, these players are almost exclusively CBs and their passing profile reflects that. However, compared to Ball-Playing Defenders, Backfield Outlets don’t progress the ball much, their xB/90 is very low, and they aren’t targeted often by other players’ passes. Their role is likely to stay open as an outlet in the backfield, then to retain possession with high percentage passes. These are mostly players we think of as defenders that aren’t particularly skilled on the ball. If a team’s starting CBs both share this role, that team probably is not building attacks from the back.

Notable players: Aaron Long, Ike Opara, Justen Glad

Notable exceptions: Andy Rose and Danny Wilson are the only midfielders that appear in this cluster.

What can we do with his information?

Like a Paddy’s Pub stress ball egg, it’s a jumping off point. For one, we could fine-tune these clusters (possibly by including defensive and/or positional data) and use them to calculate role-adjusted stats. For instance, Darlington Nagbe’s Key Pass and xA numbers might not be spectacular in general, but they’re off the charts considering he’s otherwise profiled as a Recycler. This would probably create questions about high performance within a role vs. misclassification of the player, but that’s a problem for another day.

We could also start exploring the variation within each role by, wait for it, clustering the clusters. The 19 stats used here were chosen to define general roles across the entire pitch, but we can use the existence of these roles to hone in on which stats are most important for each and use them to generate clusters within the roles.

Unlike a stress egg, these clusters are also useful as-is; We can use them to get a general idea of how a player or team might play. Has your team signed a midfielder you’ve never heard of? Their attacking role label might help you anticipate what they’ll contribute. Got a friend who’s looking to get into the MLS but the only teams they know are LAFC and Atlanta? Role distributions can help describe the attacking style of a team. Let’s play a quick game of name that 2019 MLS team:

While the formation itself won’t help us figure out the team, the labels might. We can tell that the CAM is the focal point of the attack as the main Playmaker. The goals will be coming from the left wing and center forward, although the center forward will probably provide more goals as the Pure Scorer of the team. The Support Attackers at the right wing and left CM occasionally chip in with a goal or assist here and there. The right CM will stay back and direct the buildup, as will the right CB. The left CB won’t contribute much to buildup and the fullbacks will provide width with crosses into the box.

Give up? It’s Seattle. Lodeiro is the Playmaker, Morris and Ruidiaz are the scorers.

Let’s do one more - another 4-2-3-1 team with a very different approach:

In contrast to the last team, the CAM is a Support Attacker and probably isn’t the focal point of the attack. There is no out-and-out Playmaker or Pure Scorer that the attack is focused around, but there are two Hybrid Scorers and a Wide Attacker in the front three. The attacking load is likely spread around these players, with the two Hybrid Scorers doing more shooting and the Wide Attacker doing more creating. The midfield is interesting; There’s no Recycler, suggesting all three CMs contribute somewhat significantly to the attack. The left CM is a Pivot, so he’ll likely stay farther back than the other two Support Attackers, who will occasionally get far enough forward to provide a key pass or shot. The CBs are both proficient on the ball, so this team is likely good at building from the back, and neither fullback crosses much. Any guesses?

The unique midfield pattern is what Bobby Warshaw dubbed the ‘Triple Pivot’ of 2019 FC Dallas. All three midfielders contribute to buildup, with Acosta acting as the deepest-lying general of the three (labeled here as the true ‘Pivot’ of the team). The versatile Jesus Ferreira is a Hybrid Scorer up front, with Barrios providing service (and some goals) from the wing.

Let’s wrap up with some limitations. The clearest limitation is that we’ve learned nothing about how players contribute to defense, obviously a very important part of many players’ overall roles in their team. As mentioned earlier, including defensive data can help fine-tune the process to give us even better clusters, although it might be challenging to select the defensive stats to use.

It’s also somewhat disappointing that we didn’t get more separation among strikers. I suspect that including more information about aerial ability and position while receiving a pass (back-to-goal vs. through balls maybe) might help separate out some more traditional target men, as would reclustering within our Pure Scorer and/or Hybrid Scorer roles on specifically striker-relevant variables. Thankfully, we don’t have to look too far to see what that last one might look like - Sam Goldberg’s article from a few months ago did almost exactly this to identify a small number of top strikers in the league. Notably, they aren’t all classified as Pure Scorers here.

Finally, these clusters are intended to give a rough idea of what roles exist in an attack and how players within those roles contribute to it. While the roles we ended up with seem to make sense for most players, some individual players like those listed as notable exceptions might contribute more (or less) than their label suggests. These cases indicate definite room for improvement. After embarking on this journey, though, I’d go so far as to say that any clustering analysis that seeks to apply hard labels to individual players is bound to not apply to every player in the league - there will likely always be players that don’t fit neatly into a single label. An approach that allows players to occupy multiple roles to varying degrees (such as soft clustering) might help solve that problem to some extent.

Even with these limitations, we can get a general picture of how a team/player/winger plays from examining these roles. Plus, the procedure laid out here can be used moving forward with more specific data to help improve on these roles. With labels like these, you don’t even need to watch a team play to be able to argue about them on the internet.

American Soccer Analysis

Defining Roles: How Every Player Contributes to Goals