Published On: March 6th, 2023

How we used OpenStreetMap and Wikidata to map street names across Europe – Part 2

We have been toying with the idea of doing a large-scale open-ended analysis of street names across Europe for quite some time.

Back in 2019, I quickly built an R package that facilitated retrieving the names of streets used in a municipality, tried to guess which were humans and the relative gender (simplistically operationalised as “sex at birth”), and provided a basic interface to check the data (here’s a walkthrough of the process).

We never really pursued the idea in this format for a number of reasons. Among them, the fact that in the following years, a number of initiatives used slightly different approaches to arrive at the same results.

We felt that to give more context, more than the gender dimension needed to be taken into account. For example, the extremely small number of streets dedicated to women in most cities is striking, but in the case of Italy, for example it is sometimes even more striking how many of those women were religious figures or Catholic saints.

We decided to work in this direction, and eventually came out with the first version of Mapping Diversity, covering more than 20 cities across Italy (the main city of each region).

In terms of data, its key features were: street names taken from OpenStreetMap; tentative automatic matching of street names to the Wikidata identifier of the person a given street is dedicated to; manual check limited to:

  • confirming if a street was dedicated to a human or not
  • if it was a woman
  • and if it was, match it to the relative Wikidata identifier (which allows among other things, to show date of birth, profession, and often include a small portrait in the interactive visualisation)
  • assign it to one of a small number of predefined category (e.g. religious figure, cultural figure, etc.)

This was time-consuming (have you ever considered that Rome, for example, has more than 16,000 named streets? Consider manually checking all of them), but overall, not devastatingly so. Automatic matching functioned reasonably well, and besides confirming if it was a human or not, one had to give some more thinking to the matching only if the subject was a woman (each woman had to be matched to a Wikidata identifier and a category). But since so few streets are dedicated to women… in terms of data cleaning, it was not too bad. Besides, we worked on the assumption that if a street was dedicated to a given person in one city, then the same match would be applied to all Italian cities: not a 100 per cent safe bet, but we felt it was acceptable to get wrong a few individuals out of tens of thousands of matches, as this incredibly reduced the number of streets to check (each city in Italy will have a street dedicated to “Giuseppe Garibaldi”, but we manually checked it only once).

The resulting output got reasonable visibility and contributed to stir some much needed debate. But that was in Italy, and we like European-wide stories. Also, we felt the debate needed to go further than the gender dimension, and more explicitly consider also who were all these men that have a streets dedicated to them.

We understood that this would have entailed more complexity and more work, but based on our first experience based on Italy, we moved forward in the context of our work for the European Data Journalism Network. As we will see, things turned out to be much more complicated and time-consuming than we initially imagined, because many things change considerably from country to country, and sometimes from region to region. As a consequence, even in cases where in principle there would be a more efficient, more effective, or more comprehensive solutions for a given city or country, we mostly stuck to a standardised approach for all of Europe, as this allowed to have a more consistent data pipeline.

Main data sources and data processing features

Street names

The main point of reference for openly licensed geographic data is OpenStreetMap, indeed OpenStreetMap is the source we used for street names (thank you OpenStreetMap contributors!). OpenStreetMap has overall rather comprehensive and up to date data for the main European cities, and offers both the names of streets and their exact location on a map. Many cities around Europe distribute in some format a list of street names, but less often in a format that makes it possible to draw them on a map.

However, even if OpenStreetMap is a global project, local peculiarities emerge again and again, due to differences related to language, to local practices, as well due to different standards self-defined by communities of mappers in different countries.

Dealing with multilingual names can be a mess. In OpenStreetMap there is both a generic name tag, as well as the possibility to include specific language versions. There is a set of rules on how to deal with all sorts of issues related to multinlingualism, but these can be country-specific, or often left incomplete.

We relied on the generic name tag, assuming it would be in the main language used in a given country, and dealing with exceptions either on a case by base basis (e.g. as we did with the relatively small Bozen/Bolzano in South Tyrol), or more systematically, as we did with the Brussels municipalities (which include both a French and Dutch name of street separated by a hyphen, with the order of languages depending on the municipality).

There is no commonly agreed standard on how street names should precisely be written. The broad expectation is that street names are written in the same way as they are written in the real world, but then, there’s a number of best practices, some of them country-specific. For example, if a full name of a person is known, then this should be used rather than only the first letter of the first name. But such things may be inconsistently applied within a given country, nevermind across countries. Such things not only complicate matching a given street to the person or entity it is dedicated to, but also matching it with other relevant datasets that may exist at city, region, or country level.

All things considered, for better or worse, we stuck with the primary name tag for all street names.

Next, we wanted to retrieve all streets in a given city or municipality.

Administrative boundaries

OpenStreetMap has records for administrative boundaries, but, as quite clearly stated in the relevant documentation, these are used inconsistently across countries:

“While admin_level=2 is almost always a de-facto independent country, and admin_level=4 is usually equivalent to a”province”, higher values vary in meaning between countries. A data consumer looking for municipalities corresponding to “city”, “town” or “village” boundaries will find these tagged anywhere from admin_level=4 to admin_level=10”.

In principle, we could have checked what is the practice in each country, see if these are indeed applied consistently across a given country, see if these are matched to a consistent identifier, and take it from there.

Instead, we opted to the more practical solution of relying on a dataset of Local Administrative Units (LAUs) distributed by Eurostat, that can be used, including the following copyright notice in the output: © EuroGeographics for the administrative boundaries.

This made it possible to use an already standardised dataset, with a consistent identifier that can be used to match it to other data sources. Even so, the output did not always make sense. For example, there are a few cities — including Brussels, Lisbon, and Dublin — that are composed of a number of municipalities. In such cases it may be relevant to keep the data at the municipality level, but for many readers having the full urban area would make more sense. In such cases, we fell back on the relevant NUTS region or the other aggregation level existing nationally (e.g. for Lisbon).

With standardised administrative boundaries, and with country files with all streets in a country taken from OpenStreetMap (as distributed by Geofabrik), it is straightforward to crop the national or regional data with the boundaries of the given municipality, drop all streets that have no name (often, service streets) and get a set of streets names that can easily be drawn on a map.

This is also how we derived a unique identifiers for each street. In OpenStreetMap, streets do not have a unique identifier (unless they are part of what in OSM parlance is called a relation). In our case, the identifier became a combination of the full street name itself with the gisco_id of the municipality: every streets with exactly the same name spelling within a given municipality is considered to be a single street.

Some countries or cities have unique identifiers for streets (e.g. Czechia). Some mostly larger cities have all of their streets as a separate Wikidata entity. Still, due to differences in spelling, word order, lack or presence of honorific titles, and such, achieving a full match between such datasets is not straightforward, and may need custom solutions for each relevant city. Again, rather than developing custom country or city-specific solutions, we went for a solution that could be consistently applied across our datasets.

Now comes the difficult part.

Matching streets to the entity it is named after

Potentially, there are a nu