Learning about football data: Intuition, experience and tomatoes

10th February 2021 |

By Mark Thompson

With statistics being used in football more than ever, there are many articles out there telling you how to use them. These are useful, of course, but reading how-to guides isn’t always how we learn. Just like goalkeepers learn to be aware of their surroundings after seeing moments like Dion Dublin’s peekaboo goal past Shay Given, we can get better by making a note of others’ mistakes.

In this case, we can learn from my mistakes, and pitfalls I’ve encountered (and hopefully mostly avoided) when working with football data. By the end of this, we’ll have picked up some tricks to avoid these pitfalls, as simple as goalkeepers checking whether there’s anyone behind them before they drop the ball on the floor.

Check the spread

One problem with using data is the ease of it, the false sense of security it can provide. Statistics that collect a player or team’s performance over a season are tremendously useful – our brains can’t easily do that – but they can flatten context out.

A big influence is the fixture schedule but a harder to spot one can be freak outlier games. If a player gets two tap-ins by an open goal or a team, say, win 9-0, that can skew the averages, particularly early in the season.

A pretty extreme example of this was Romelu Lukaku’s start to the 2018/19 league campaign. With two goals inside the six-yard box, he picked up chances worth 1.88 expected goals in one match against Burnley (by Twenty3’s model). What’s more, this game was in early September.

Learning about football data: Romelu Lukaku's shot map for Manchester United against Burnley on 2 September 2018.

It meant that, even after ten games of the season, the Belgian’s rate of non-penalty expected goals per 90 minutes was 0.45… but just 0.27 if you took out the Burnley game.

To believe or not to believe

Sometimes something won’t stack up between what you’re seeing yourself when watching games and what data seems to be suggesting. In these cases, it can be tempting to disregard one or the other. However, quite often the ‘truth’ will be somewhere in the nuanced middle between those two binary options.

Take Tottenham Hotspur’s run of good results against big opposition in late November and early December. Across the three matches against Manchester City, Chelsea, and Arsenal, Spurs had chances worth 0.81 expected goals and conceded chances worth 4.43 expected goals. Despite this, they kept three clean sheets, scored four goals, and took seven points from a possible nine.

The eyes may have said that this was José Mourinho back to his best; the stats may have suggested that this was an undeserved streak of success.

Learning about football data: Tottenham's shots for and against maps vs. Manchester City, Chelsea and Arsenal.

However, it’s an established pattern that teams play differently when winning: less likely to take shots and more likely to allow them. Spurs went ahead in the fifth minute against Man City and the 13th against Arsenal; meanwhile drawing away at Chelsea may have had a similar effect of being like they were leading. That goes a long way to explaining Spurs’ low expected goals figures and, to a degree, how many chances their opponents got.

As well as that, the model may not fully account for the unusually high number of players Mourinho had behind the ball, and the graphic above also shows how many of the shots they faced were from distance. It might be a little lucky that Spurs didn’t concede from any of the shots around the six-yard box, but it’s unlikely that it was either pure luck or pure bad modelling that led to this disparity between xG and goals conceded.

That stretch of games was much more nuanced than the eyes or the stats being wrong. In this particular example, I think that it would be fairest to say that the data provided a cautionary note; that while there were reasons why the stats didn’t reflect reality, the stats were also a reminder that there’s another way those matches could have gone.

What’s a tomato?

It took a long time for tomatoes to get a good rep in Europe. When they were first brought over from the Americas in the colonial age, the Europeans didn’t know what to do with them. It didn’t help matters that, particularly in the more northerly areas, it seems that people were trying tomatoes without waiting for them to fully ripen.

Things can be like that with football data too. Even if a statistic seems like it’s straightforward to understand, there might be something about how best to use it that only becomes apparent with familiarity.

Defensive stats are a particularly good example of this, often used in the same ‘most equals best’ way that things like shots are. If they’re ordered from largest to smallest amount though, names near the top of the list tend to be full-backs and defensive midfielders. The former group might be particularly puzzling: should teams move them to a more central area?

Learning about football data: A ranking graphic showing the Premier League players with the most successful defensive actions per 90.

Probably not. The nature of being a full-back — a defensive player by the touchline — means that opponents have fewer options of where to go. Play also tends to go in natural U shapes around the edges of a team’s defensive block, ending up on the wings where it feels less dangerous to potentially lose the ball than in the centre.

As a result, the way that the sport is set up can be a big boost to why full-backs make defensive actions, and the same may be true for the other players at the top of defensive actions lists. Knowing that might affect the way you choose to use these statistics, and all it took was simply checking which players and what positions are at the top of the list.

Checking for outliers and sorting lists isn’t something that needs doing before every use of stats, partly because doing them just a couple of times helps to develop an instinct. Like in cooking, experience helps you to know when something’s off.

It took a few centuries of tomato-growing before Italy came up with a decent Pomodoro sauce. Hopefully, this blog will mean you’ll be able to get an intuition about football stats a little quicker.

In this article, we used Wyscout data and all graphics were produced in the Twenty3 Toolbox.

If you’d like to learn more about our products or services, and how they might be able to help you, or how to best use football data in your content, don’t hesitate to get in touch.