Don't miss the train: Methodology

What we did and how we did it

This article tries to answer a seemingly straightforward question: how easy it is for citizens in Europe to travel by train, and what explains differences within countries? In our attempt to answer this, we wanted to look at two measures – distances to train stations and the proportion of people who are well connected to (less than 10,000 steps to a station) versus poorly connected from (at least 30,000 steps to a station) the railway network.

Distances to a train station tell us how far someone must travel, while the other measurement gives us an idea of how many people must rely on a car, bus or taxi in order to get to the train station. The cut-off point of 30,000 steps is arbitrary but is based on the assumption that you’re not likely to walk that distance just to get to a train.

We haven’t found any official or unofficial source that contains data for European countries on distances to train stations or numbers on how many people live near a train station. This meant that we needed to gather and create this data ourselves.

Step by step, this is what we did:

1. In July, 2019, we downloaded all train stations from the European Register of Infrastructure (RINF, access requires login). The total number of stations downloaded was 29.511.

The Register of Infrastructure did not contain data for Ireland and Switzerland. The data quality for Romania and the Netherlands was very poor. We ended up excluding Romania, Ireland and Switzerland from the analysis. For the Netherlands we collected extra stations from Wikipedia. We also added more stations from other sources for Italy and Germany.

2. In October, 2019, we searched for the name of the train stations in the HaCon Fahrplan-Auskunfts-System (HAFAS) and combined the two data sources when they pointed to the same train station. This step is pretty much equivalent to searching for a train station using Deutsche-Bahn’s booking system . Since we wanted to look up thousands of stations, we used this client which connects to the HAFAS public transport APIs.

To put it simply, we searched for a station name from RINF, the HAFAS client returned all similar station names in the database, and we selected the one that lay closest to the RINF station. This wasn’t always obvious. For every station in RINF that HAFAS didn’t know, we manually checked if they indeed were train stations. Some were, and some weren’t. We only moved forward with countries where we could either map or explain at least 90 per cent of RINF stations. The following countries were removed in this step: Estonia, Latvia, Lithuania, Greece, Spain, Norway and Slovakia. This means that HAFAS data for these countries was especially poor at the time.

3. We only included countries where HAFAS had full coverage on train timetables.

We estimated coverage by simulating five known train routes per country, checking if HAFAS knew that you could travel them by train (routes can be seen here ). Only if the database successfully identified all five routes in a given country as travelable by train, we moved forward with the analysis for that country.

Out of the 16 countries we left the previous step with, we found that the API correctly identified all of the routes as available by train.

4. We categorised train stations as relevant or not relevant. The total number of relevant train stations was 22,852.

According to us, a relevant train station is a train station from which you can travel by train (changes to other trains are allowed) to the capital city of a given country. A not relevant train station is a train station from which you have to drive or take the bus (even if only part of the route) to get to the capital.

We categorised each station as relevant or not relevant, depending on the response from HAFAS when simulating 25 journeys from each station to the train station in the country’s capital. We double-checked and, when necessary, manually (re)categorised all stations that, according to the data from HAFAS, were not relevant. We categorised stations on Sicily and Sardinia as relevant because partners suggested that the train was a viable option to travel to the capital from these islands, even though you also need to take a ferry. We manually added more stations for Italy (+116), the Netherlands (+11) and Germany (+320) that were missing from either RINF or HAFAS. 

5. For each country, we found the nearest relevant train station for each person in the country.

We drew as-the-crow-flies lines between all squares in a 1x1 km population grid and all relevant train stations in a given country and classed the shortest line as the closest train station.

General problems with the quality of our data

We have not found any open international or regional source, private or otherwise, that contains an exhaustive list of train stations across Europe. If you are interested in the subject of train stations you are, as we see it, left with four alternatives: HAFAS (HaCon Fahrplan-Auskunfts-System), RINF (European Register of Infrastructure), national authorities or any of the crowdsourced lists that can be found online.

HAFAS is a booking system developed by the privately owned Siemens-subsidiary Hannover Consulting. The upside of HAFAS is that quite a few big carriers in countries in Europe use it on their booking websites. The downside is that it’s proprietary and the company doesn’t publish their list of stations – you have to create your own. Doing this was not a viable option for us, and as we realised – the quality of the data in HAFAS is sometimes poor. Not only does the system lack stations for entire regions in some countries, the location of the train station is sometimes completely off.

The European Register of Infrastructure is maintained by the European Railway Agency and every member state (as well as Norway and Switzerland) is supposed to report stations (and other railway related information) to the database. RINF is, as far as we understand it, the most exhaustive official list of stations across Europe. We determined that RINF is the best we can do short of approaching all states individually. With hindsight, we noticed that that coverage of private railways were less likely to be included in RINF for some countries. You can read more about RINF here .

Crowdsourced lists are not a bad alternative, but since we wanted to check if you could travel from a specific station we were relying on the name and coordinates to match another source (in our case HAFAS) which is why we determined that an official source was more appropriate.

Another issue with our data is that the population grid that we use to represent people in Europe is from 2011. Undoubtedly, populations have increased and countries have become more urbanised since then, but it’s the latest grid available.

Notes

  • The following places have been excluded from the analysis, even though they belong to or are connected to countries that we've analysed. Corsica (France), Bornholm (Denmark), Northern Ireland (United Kingdom), Isle of Wight (United Kingdom), Orkney (United Kingdom), Shetland (United Kingdom), Western Isles (United Kingdom), Åland (Finland), Azores (Portugal), Madeira (Portugal) and Gotland (Sweden). We excluded these areas because they are not connected to the mainland where the end destination is. There are other populated islands that are included in our results, for example Lampedusa (Italy) and Heligoland (Germany).
  • All stations on Denmark's new Letbanen light rail were not yet in the European station data when we conducted the research.

FAQ

How did you calculate steps?

Km / 1.6 * 1975 (source )

How do you determine if an area is urban, rural or intermediate?

We have mapped all population squares to NUTS3 regions and then mapped each square to the corresponding NUTS3 typology as defined by Eurostat here .

Do we know that the European Register of Infrastructure contains all the stations across Europe?

No. In fact we know that the register was missing a lot of Romanian and Dutch stations when we exported the data. We also know that the RINF data contained a lot of stations that are no longer in use. Mapping the RINF data to HAFAS allowed us to exclude as many false positives as possible. We manually added more stations from other sources for the Netherlands, Germany and Italy, but we don’t know for sure that we have managed to find all relevant stations – on the contrary, it's very likely that we’ve missed some. Please let us know if you notice any gap.

Is a RINF station always mapped to the correct HAFAS station?

No. But it is mapped to a HAFAS station with a similar name that lies within 500 meters from it, which usually means that it is the same station. These kinds of mismatches exist but are few and are not expected to affect the analysis.

Do we know that HAFAS has correct and updated timetables for all stations?

No. Our assumption is that if HAFAS says that you can travel from a station by train, then HAFAS is correct. We also assume that if HAFAS says that you can’t travel from a station by train – that response may be false. That is why we manually checked and (when necessary) recategorised all stations that had such responses from HAFAS.

How do we measure distance?

As the crow flies/bee lines.

Some people live on the border to a neighbouring EU country. Why don’t we allow for a train station across the border to be relevant?

The data doesn’t allow for a cross border analysis since we don’t have data for enough countries.

Are suburban trains and stations included (i.e. S-bahn, Pendeltåg et.c.)?

We haven’t categorised trains or stations by type of traffic (transnational, national, regional or suburban). This means that in some cases suburban trains are included and in some cases they aren’t. Since we’ve simulated journeys to any station in the capital, we’ve determined that it’s safe to assume that suburban trains will mostly affect distances in NUTS3 regions where the capital is.

Is it possible that you categorised a train station as not relevant because of temporary suspensions in traffic?

Yes, but it’s unlikely. For every station that was initially categorised as not relevant, we double-checked the classification by searching for a journey to the capital on the national booking website with different departure dates. In doing so, we noticed for example that Denmark had a lot of railway work that was skewing our data, which we remedied by recategorising the stations as relevant. Longer suspensions in traffic (+1 month) may however have implications that we’re unaware of as some booking sites don’t allow you to search for a journey further ahead than that.

Design

Sheldon.studio took care of the design of the main article of the investigation together with OBC Transeuropa: here Sheldon's Matteo Moretti shares some reflections on the choices that drove their project.

Wednesday 18 December 2019

Source/s:

Journalism++
share subcribe newsletter