Published On: August 3rd, 2021

The data you need to win the Olympics if you go NUTS

When everybody’s moved by the contagious joy of two athletes making history by agreeing to share an Olympic gold medal, the data analyst thinks: “two gold medals for the same competition? is this going to break my dashboard?”

“Can we have two gold? ?”

“Let’s make history, man”#GOLD #Tokyo2020

— Giuseppe Famà (@FamaNelMondo) August 1, 2021

In my case, I was worried it would break my parsing script. Fortunately, it didn’t, so after a quick check I could share their joy with relief. And then go back to wondering about the fairness of the Olympic medal table.

Armchair sports fans traditionally like to peddle with the Olympic medal table, asking themselves the customary questions… is it fair that the ranking is usually based on who gets the most gold medals? What if the ranking was instead based on some sort of weighted average of the medals? If — say — a gold medal is worth twice as much as a silver, and a silver twice as much as bronze, then would my favourite flag wave at the top of the ranking?

But just changing the value of medals brings you only so far, and the medal table may still look disagreeable. What else can be done to shake it up a bit further?

Being mildly annoyed by the excessive waving of national flags, I decided it would be nice to set up a medal table based on the number of medals won by regions, not by countries.

So… there we go.

Going NUTS, and beyond

How do we attribute a medal to a region? There’s no perfect approach, but place of birth of athletes should be meaningful enough in most cases. So all that is needed is find the place of birth of all medalists, geocode it, associate it with administrative entities of reasonable size, et… voilà!

So here’s how I went about it, considering that I wanted to have data unencumbered by copyright to share the fun.

  1. get all the Olympic medalist by country, sport, and event from the List of 2020 Summer Olympics medal winners available on Wikipedia
  2. parse all the tables on that page to extract the relevant information, including links to the Wikipedia page of each medalist (ultimately, this proved to be the most painful part)
  3. query the Wikipedia API to get the Wikidata ID of each medalist
  4. proceed in the much more data-friendly Wikidata, and get the place of birth of each medalist
  5. get from Wikidata the coordinates of the place of birth of each medalist
  6. match the coordinates to the administrative units where they are located. Data from Wikidata may have some inconsistencies due to different administrative subdivisions around the world, so on top of the ones included in Wikiata I did the geo-matching with NUTS regions — a standardised classification of administrative entities defined by the European Union (here’s a list of countries covered by NUTS, and here’s the geographic dataset for download)
  7. (optional) get from Wikidata all other sorts of data about the medalists or the place where they were born

The dataset

The resulting dataset is now available on GitHub. You can also check out the script used to retrieve the data.

How’s the quality of the dataset? Not too bad, in particular, for Europe, as the place of birth of most medalists is recorded in Wikidata. Globally, Wikidata has recorded the place of birth for about 80 per cent of medalists. It’s a good starting point.

The good news is that you can join the fun! :-)

Of course, you can contribute by adding to Wikidata the missing information, which is often available online.

And then… see who you can place at the top of the medal table by playing around with the data.

You can find the dataset, the code and all details about data parsing in this repository.

If you want to have a quick look at the data based on place of birth, why not head on to our online tool, latlon2map, that makes it easy to explore quickly all tabular data that have a longitude and latitude column. Upload the csv file there, set the right latitude and longitude columns, and you’re good to go.

Screenshot of the web interface of latlon2map

As for me, tired with all the data processing, I took a lazy if controversial approach. I decided that I will count each medal and each medalist the same. Bronze and gold look just as shiny to me. The official table counts just one medal for a team win, but I think that each piece of metal is a piece of joy for the person who brings it home: if eight people row together to get a medal and each can place a medal around their neck, then that’s eight medals for me. And then… let’s see how things go by focusing only on NUTS regions in Europe.

So… here’s my medal table for NUTS2:

Table 1: Total medals by place of birth of medalist at the 2020 Summer Olympics / NUTS2

So what sport is it that folks in Ile-de-France are so good at? Mostly judo and fencing, it appears.

Table 2: Medals by sport for medalists born in Ile-de-France
Map 1: Olympics medals by NUTS2 region

And what if we adjust by population… would the ranking change significantly?

Table 3: Medals per million residents of the NUTS2 region where each medalist at the 2020 Summer Olympics was born / NUTS2

I like this one. I like lazy summer afternoons.

If you want to play around with this dataset or expand on it, you can download it as a .csv file from this repository, where you find also a detailed description of the procedure used to generate it.

P.S. The data in this post have been last updated on: 2021–08–03 15:14:28.