The Poundshop Messi Machine
Introducing a player-similarity tool to help navigate the transfer market on a budget
(If you’re reading this in your email inbox, Substack tells me it’s too long to be displayed in full - sorry about that. And if you’re reading this in any format but are on a tight schedule, the short version is that I’ve made this.)
Perhaps it’s just a textbook case of Twitter warping my perception of reality, but it feels like the buzz, clamour and occasional full-blown moral panic that surrounds the summer transfer window has snowballed over the last couple of years to the point where its level of public engagement seems to dwarf that even of the sport itself at times.
For the uninitiated, from May through August plus January’s slightly less popular afterparty, football’s sizeable sect of transfer-zealots observe their annual ritual of flooding Twitter with demands that their club “announce” the arrival of whichever messianic saviour-figure happens to be in vogue that day. Last year, the whirlwind surrounding Arsenal’s supposed pursuit of Houssem Aouar became so overwhelming that I muted the word ‘announce’ on Twitter to try to swerve at least its most incessant elements (a policy which, despite a promising period of initial success, proved bitterly flawed as the new season kicked off and I was left frantically googling dozens of these new, mysterious faces who’d arrived to the league under my self-imposed cloak of announcement-darkness).
If I was a real writer, this might be my cue to leap earnestly into an evocative thinkpiece, perhaps describing it all as a “phenomenon” and using words like “cultural spectacle” and “zeitgeist” to explain how “the burgeoning cultural spectacle of the modern transfer market” has something to do with the “zeitgeist”.
Maybe one day. For now, though, I’m simply going to throw my hat in the ring and join in the circus.
Behold: The Poundshop Messi Machine.
In my first two pieces here, I presented a cluster analysis which sought to define and investigate eight distinct playing styles by which we could categorise football coaches over the last four seasons. The logical next step, then, was to attempt a similar project on the individual players which make up those teams and see what we can do with that information too. So that’s what I’ve done, and later on I’ll be demonstrating how it can be used to help navigate the transfer market. First, though, I’ll go through some of the mechanics involved behind the scenes.
I won’t go into too much detail, as I did a fair bit of that last time and the principles are essentially the same: I programmed a ‘dimension reduction’ algorithm (called UMAP) to read through a load of stats (taken from, as ever, StatsBomb via FBref) and assign a set of graph coordinates to every outfield player who clocked at least 900 minutes (a.k.a. ten ‘90s’) of playing time across any of the last four seasons, and then trained a clustering model to find distinct groupings within the data. (For those interested, I used a Gaussian Mixture Model this time, as the K-Means technique I had applied to the coaches tended to produce somewhat untidy results.)
As an initial step, I instructed the model to separate the 5,948 players in the dataset into four more general groups - or ‘macro clusters’ as I’ll slightly pompously refer to them from now on - which were then themselves divided into further, more specific clusters to give us a total of 11. We won’t be worrying too much about the macro-clusters, but before we speed off into the distance, let’s have a quick glance in their direction.
There’s no real need to dig too much deeper here, as the cluster names alone do a good enough job of explaining themselves - ‘Wide D/M’ is shorthand for ‘Wide Defender / Midfielder’, and the other three should hopefully sound pretty familiar.
(Edit: I forgot to mention here that all defensive stats are possession-adjusted)
So, as mentioned above, the clustering process was then performed again on each of these four macro groups to create our final selection, with all of them being split into three further clusters other than the Defenders, who were divided into just two. The reason for this elongated process is because directing the model to search for anything more than five or six groups right from the off made for a fairly uneven spread of clusters across the graph/map (often this would result in clumps of small clusters in the busier, less clearly delineated areas of the map as the model saw no immediate need to break down the Defenders or Wide D/M’s any further).
Anyway, here, then, are our 11 clusters:
There’s a lot of information there, so, if you’re in a rush, here’s a quick line or two on each of them:
Defensive Defender - For fans of tackles, clearances and long passes; not for fans of any part of the pitch outside the defensive third.
Adventurous Defender - All of the above, but more midfield action and partial to the occasional ball-carry etc too.
Defensive Wide D/M - Essentially just remove the ‘M’ from D/M here and you get the picture. A cautious full-back, if you like.
Controlling Wide D/M - They play in wide areas, but aren’t quite as hell-bent on repeatedly flying up the wing to pump in crosses as much as the following cluster do. There’s a healthy dollop of Guardiola full-backs in here if that paints a clearer picture for you, and see the high pass completion and short passing rates as well. Interestingly, you get the occasional unexpected name in here, too - Christian Eriksen (Inter 20/21) and Emile Smith-Rowe (Arsenal 20/21) are both lurking on the outskirts, for example. You can’t get it 100% right 100% of the time, I suppose.
Direct Wide D/M - Cross, run, smash, grab. Generally speaking, these are what we think of as attacking full-backs, wing-backs and a handful of more ‘traditional’ touchline-hugging wingers too.
Midfield Aggressor - What it says on the tin, really. Plays in the middle of the pitch and likes to do things like tackles, headers and pressures, and keeps the passing fairly simple.
Deep Playmaker - Controls the game from slightly deeper in midfield. Not afraid to hit a longer ball or switch of play to drive the team from defence to attack, but tends not to bomb on and join in the fun in the final third and penalty area for the most part.
Advanced Playmaker - There’s sometimes little to separate the midfield playmakers, especially around their shared border, but generally speaking they do their playmaking somewhat further up the pitch and are more aggressive with their ball-carrying.
Attacking Creator - Like the Advanced Playmaker, but typically camped out in the final third. Home to plenty of ‘classic number 10s’ as well as some at the more attacking end of the wide-midfielder scale.
All-Round Forward - Also a perfectly adept creator, but with more shots and dribbles thrown in the mix. Ranging from inside-forwards like Mohamed Salah or Sadio Mane, to the kind of central striker like Harry Kane or Romelu Lukaku who’s also happy dropping a little deeper to help make the play too.
Line-Leader - The classic number 9. Takes shots, scores goals, rarely moves from the penalty area and piles the pressure on defenders when out of possession.
So what does all this have to do with transfers? Well, having categorised each individual player in this way, if you then take a step back to cast your eye over the club level once again, you’ll see before you a nice starting point for a little squad-building project. Looking at the ‘demographics’ of each club, i.e. what the spread of clusters looks like across their playing staff, you can begin to piece together an idea of which areas of their squad may be lacking - or indeed which are perhaps overpopulated.
On top of that, we can then return to the previous project we undertook and make use of the coaching clusters we found too. With those added into the equation, it’s possible to paint a picture of what kind of players teams tend to have when trying to play a particular style of football. And with the help of the Poundshop Messi Machine, paint a picture we will.
Using a club close to my heart by way of example, we can see that Arsenal’s squad last season was stocked full of Deep Playmakers (3), but the shelves were very much bare when it came to their Advanced counterparts. On top of that, though they did count one Attacking Creator amongst their ranks, that was 31 year-old Willian who, I’m sorry to say, probably shouldn’t be the club’s long, medium, or short-term solution here. For teams in the Relentless cluster, which one presumes is Mikel Arteta’s desired style of football, the average count is 1.42 for the Deep Playmaker, 1.79 for the Advanced Playmaker, and 1.83 for Attacking Creators, while in the Positive group, where Arteta currently finds himself, it’s 1.27, 1.37 and 1.54.
The numbers, then, suggest that Arsenal’s midfield might be a little unbalanced and their attack lacking a creative flourish - a suggestion which fits pretty comfortably alongside the public perception of where the team needs to improve.
So, while the club could definitely do with bringing at least two players in to fill these gaps, let’s turn our attention to just the one and agree, for the sake of argument, that Arsenal should prioritise an Attacking Creator (or ‘AC’’ from now on) to strengthen their squad for the coming season - what then? First, let’s return to the heatmap to look at what characterises the AC, and then from there, see if we can use that to find the cream of the attack-creating crop.
Aside from their rates of final third-touches, you can see that the AC scores reasonably highly in terms of switches, through balls, progressive passes, passes into the penalty area, and a few things to do with ball-carries. On top of that, there are, of course, other metrics out there that weren’t involved in the clustering process by which we can judge players. In this case, things like assists and xA (expected assists, a stat which counts ‘the xG which follows a pass that assists a shot - this indicates a player's ability to set up scoring chances without having to rely on the actual result of the shot or the shooter's luck/ability’, to use FBref’s definition) will be of particular use in measuring their actual productive output.
And indeed, the assists and xA columns are exactly where we’ll head first - in our case, the task of figuring out the position’s best and brightest is actually pretty simple. As shown in the screenshot above, I’ve taken the executive decision on your behalf to hit the “toggle percentile ranks button”, and I’ll quickly explain what that means before moving on.
Though many of the metrics used here are essentially represented in terms of percentages anyway (% of touches in/passes into X area of the pitch, number of X per 100 touches etc.), it can be hard to judge what a “good” score is for some of them. Take the metric regarding through balls as a % of passes - is 1.5% a high or low score? It’s not really the kind of stat you’ll see out in the wild anywhere, so there’s no immediate frame of reference. Transposing everything on to the same 0-100 scale, then, sometimes makes life easier - and you can of course always return to the ‘raw’ figures when taking a more detailed look further down the line. Sticking with that 0-100 scale for now, though, the numbers here show a player’s rank, in percentage form, compared to all other members of their macro cluster for a given stat - so in this case, an AC scoring 100 for a metric means they do more of that thing than any other Attacker, including All-Round Forwards and Line-Leaders.
Moving on, as you can see, after sorting the data by xA, we immediately stumble upon an incredibly appetising list of players. Which is why I picked the Attacking Creator as an example: again, it’s fairly straightforward to judge the quality of players in attacking clusters, whereas it can take a little more brain power to perform the same task over other areas of the pitch - and my brain isn’t in a very powerful mood today. Pick out pretty much any one of these players and you’ll find they rate highly across the majority of metrics listed earlier, as well as a couple of others I’ve chucked into the mix relating to pressures and tackles. (I have a low tolerance threshold for boredom, and, as such, I’d prefer not to have to sit through yet another season of watching a club who rank in the bottom half of the table for high-pressing metrics. Perhaps an AC with a proven history of defensive enthusiasm could help Arsenal leapfrog the likes of Burnley here, at least. Otherwise, I may be forced to request a January loan move to Leeds. This is an ultimatum.)
As lovely a list as it is, though, deep down, we have to admit to ourselves that it’d be a waste of time to stop our search right this second and get firing out those “announce di maria/de bruyne/müller” tweets. Let’s not forget, of course, that earlier this year Arsenal and the 11 other Founding Clubs were left with no option but to pursue a breakaway Super League after all their wealth was violently seized by UEFA communists to fund a failed annexation of the western-Uzbek desert-lands. And Messi’s just become unavailable anyway. Our search, then, must now turn its gaze away from the alluring lights and sparkling riches of the big city, and towards the untapped potential and discounted office space of the up-and-coming satellite town.
While we could now set off down the data table in search of a more budget-friendly option, there’s 659 names in there and this article can’t go on for that much longer (spoiler: it might). Luckily, there’s another route we can take. Cast your mind back to the clustering process and remember that the word ‘map’ made a couple of appearances there. Just as I was able to use a map to ascertain that western-Uzbek desert-lands do indeed border UEFA’s Kazakh territory, using a dimension reduction algorithm to flatten down a long and wide array of player data into a plottable 2D form allows us to measure the distance between each point on a graph and find every player’s statistical lookalikes.
One problem with that, though, is that the map used earlier didn’t take into account stats like xA or xG as the clustering process was designed not to find groups of players who are similarly good, but similar stylistically. Those things would come in handy now, though, so I ran the algorithm again, this time including xG and xA, and produced a new ‘map’ with which we can find players who are alike both in style and in their productive output.
The next step, then, is to plug a few of the names from earlier into the machine and see if it gives us anything to chew on. To help further, I’m going to filter results by age and by season too. Arsenal’s three signings so far this window - Ben White (23), Sambi Lokonga (21) and Nuno Tavares (21) - are all pretty young, and so I’m going to say this is the kind of age bracket we should be looking at - up to and including 25, to be exact. In terms of seasons, I’ve opted for excluding 17/18 and 18/19 to narrow the list down and speed things up - it’d also be fine to leave them all in-place, but we may as well keep things within recent memory for now. Here goes:
As you can see, the results are all pretty much identical no matter whose name we enter, which is nice and helpful. Now, as you’ll see, every time you plug a player into the machine, a corresponding ‘pizza chart’ of percentile ranks is also produced (with thanks to the excellent mplsoccer package) to help visualise the data contained in the big old table we were sifting through before, and you can customise this by changing the metrics you want to look at as well as overlaying a second player on top to compare their scores (*as long as they’re both members of the same macro cluster). At the top of all but one of the doppelganger-shortlists is Christopher Nkunku (Leipzig 19/20), so we should definitely add him to the comparison pile. A little further down, recurring characters like Daichi Kamada (Frankfurt), Adam Ounas (Nice and Crotone but owned by Napoli) and Aleksandr Golovin (Monaco) would definitely be worth looking at too, but I can’t be bothered.
The 3-capacity shortlist I’ve just decided we’ll be limited to, then, will be Christopher Nkunku (Leipzig 19/20), Yusuf Yazıcı (Lille 20/21) and my old nemesis Houssem Aouar (Lyon 20/21). Here’s how they stack up against some of the bigger names:
My first course of action will be to immediately exclude Houssem Aouar, owing particularly to lack of passes into the penalty area and attacking third pressures, and I do so with sincere apologies to those who missed out on this shortlist just so I could give him another mention. When it comes to the other two, meanwhile, it looks like a pretty close-run thing, as is made particularly apparent when comparing them directly:
As you can see, we also have data for Christopher Nkunku’s 20/21 season with Leipzig. His latest season, it’s worth noting, saw him move into the All-Round Forward cluster. This shouldn’t be seen as an issue, as the clusters don’t actually exist in the real world, of course, and it’s hardly uncommon for players to play different types of roles across their career anyway. In fact, it’s tempting to look at the (perceived, at least) tweak in playing style as a positive; as a sign that he’s adaptable as a player and can clearly succeed in both camps, judging by those xA numbers.
If we open a new browser tab and give the two of them a google, we can see that both signed for their current clubs in summer 2019, so not much difference there either. If we go to their Transfermarkt pages, however, we can see that they estimate Nkunku’s current market value at £38.7m. For Yazıcı, though, they reckon it’s £18m. Jackpot. Add on to that the fact that (and I’m so sorry about the absolute vulture I’m about to shapeshift into) Lille, like many French clubs at the moment, are in a bit of a sticky situation financially, a deal for Yazıcı starts to sound far more achievable than one for the highly-rated former-PSG player.
So there we have it. There’s a few weeks still until the transfer window SLAMS shut, during which I’ll be sure to revisit my Poundshop Messi Machine to have a look at other clubs or players of particular interest - and feel free to give me some ideas for that too. There’s also tonnes of ways I know the whole site can work better (for instance I’ve spent most of the last 1-2 weeks trying, but so far failing, to figure out a smooth, user-friendly way to add filters for all the metrics across the data table) so this is by no means the finished article. (Update: data table now comes with more filters than anyone could ever ask for.)
This article’s finished, though. I hope you enjoy fiddling about with all the buttons and sliders, and don’t forget - you heard it here first: #AnnounceYazıcı
(#ButIfYouCanAffordNkunkuMaybeAnnounceHimInstead)