Search4EU: Building a journalist’s ideal librarian
This piece details the work of the Datactivist team with the European Data Journalism Network (EDJNet) to master and use Aleph to build a search engine for public documents, regardless of their format or language.
The repository containing the code for the Search4EU project is available under a Creative Commons license. We are also assembling a user guide as a wiki for the repository, for those who would like to use Aleph for their own crawling and indexation projects.
EDJNet was the perfect place to explore this tool. We also thank and praise the wonderful team at OCCRP who put up this game-changing tool and welcomed, listened and helped us all along : your work is gold.
Sylvain discovered the Aleph stack at the 2018 Open Government Partnership summit in Tbilisi (Georgia). After a pretty classic presentation on open source investigation, he sat at the table of a digital nomad/dev working at the time for OCCRP in Sarajevo who flipped her laptop open at the black interface he would later become familiar with: a search engine to query several databases covering pretty much every country in the world.
In the meantime, Mathieu was working on the best ways to explore administrative acts. Because of legal obligations (for example, the need to use the signed version of a city council’s decision to make it valid), most of those were published as image-based PDF files, effectively preventing them from being indexed and searched. This trove of documents, with the data and information it contained, was technically locked.
Both cooperators for Datactivist, an open data consulting company based in Provence, France, we were discussing our researches at the Ecomotive café, down the great white stairs of the Saint-Charles station in Marseille. It’s only after a couple of hours that we realised that one could have the key to the safe the other was trying to open.
Using Aleph to index a batch of files
Our first attempt consisted in using Aleph as a local search engine: feeding it with batches of PDF files and letting it sort them out, reading their content and connecting the dots to help our research through them… a sort of home-made librarian.
This feature is actually the first intent of Aleph. The original OCCRP’s Aleph looks up more than 300 publics datasets and leaks, from the German companies registry to the blog posts of the assassinated Maltese reporter Daphne Caruana-Galizia’s blog or Wikileaks publications. It is, by itself, an open air gold mine.
We already had a batch of files to dig in, so we decided to come in by the easy path: throw them at Aleph and see how it chew it.
We decided to go slow by using Aleph locally first. It meant installing Aleph, updating it and launching it from our command line to get a local URL that could only be accessed from our computer.
We had already scraped documents from our local governments: registers of administrative acts of the Marseille city council, and reports of local government meetings from several cities in Seine-Saint-Denis.
The operation is quite simple: you open the command line and you point Aleph at what you want it to “ingest”. In this case, a folder containing all the PDF files. Not complicated but not very efficient either: without the extra metadata explaining what was in the file, the only intel Aleph could retrieve was what it could find in the written content of the documents… and it was pretty messy.
In practice: the Marseille registers use a two column layout. This let the Aleph program wonder: where does one decision start and where does it end? Where is that date? Where’s the title? In the end, our batch was ingested but the nutrient for investigation was missing badly…
Aleph is not a file cabinet, it’s a search engine. And without meta data, a search engine is as messy as the photo folder in your smartphone (when you haven’t sorted it in two or three years).
We thought at first that the information contained in the docs would be sufficient for Aleph to self-organize. But it was foolish. Not because Aleph is bad at it but because the documents come into so many sizes and shapes, you can’t predict the metadata and data that will eventually be extracted from it!
We needed metadata. And, to gather some, we needed to use the full Aleph stack and start looking at Memorious. To learn how to do it yourself, check the Aleph section of our user guide once it’s out.
Using Memorious to extract public documents
We decided to turn to Memorious, the “crawling” tool of the Aleph stack, to automatically retrieve big batches of documents from public websites. In practical terms, Memorious allows you to manage small programs (the crawlers) that will follow a given path to retrieve documents. These crawlers can be defined using a set of existing functions, whose behavior can be customized through various parameters, as a pipeline of stages: generating seed URLs, connecting to services, following links, cleaning content, storing documents, etc. Of course, you can also write your own functions in Python (and that’s what Mathieu did, on several occasions, to fill the gaps or fine-tune crawlers).
We wanted to harness the power of Memorious to help our colleagues from Civio investigate the access to mental healthcare in Europe. And we aimed too high for a start : retrieve all the public decisions regarding health in the Spanish national and provincial legislature available online.
In practice, it meant designing one crawler per website that would generate the URL for a decision, retrieve the page, extract from the layout the part of the page on the decision, clean it and store it. Easier said than done.
We first asked our colleague from Civio where to look for. She provided us with a long list of websites, mainly national and local authorities : Boletin Oficial del Estado (the Spanish Official journal), its equivalent in each autonomous province and city, the Spanish Congress website, etc.
We decided to focus first on the official gazettes, as they represented the most promising source of documents with the wider variety.
The first difficulty we encountered was that each website had its own structure that reflected into the shape of the URLs of the decisions. The place of publication for the deliberation of the Junta of Andalucia, the Boletín Oficial de la Junta de Andalucía (which we ended up calling the Boja), labelled each URL with the year and day of publication. Decisions are published almost every day. Fortunately, Memorious had this covered: with init methods, we could generate the pieces of the URL, adding one day on year down or up, and paste them together to look up for legitimate URL. We had to do it for each website. And, here, we deeply regretted our Spanish classes.
But this part was the easy one. Once we got to the page itself, we needed to target the right link to follow to get to the docs. And that’s where things got complicated: some of those sites generate their pages as they are asked for (with dynamic pages). We hit a wall several times, trying to point at something that we “saw” but that wasn’t available at a given URL. We couldn’t point at a place, we had to trace the path to it from a starting page. Memorious can do that, but you have to become a first class GPS device to get it where you want.
Back from this Spanish expedition, we came to understand the importance of mastering every step of the pipeline and the versatility of it: the many options offered by the already existing functions and the level of detail with which you can specify what you keep and what you stash came as essential tools to use the crawler in websites as complex as those.
The second important thing: we would have greatly benefited from a good guide in this mess! At the crossroads of legal terminology and local parameters, those pages may look as an amusement park for an experienced hiker of the Spanish legislative path. For us, it was a foggy maze of exotic references and strange labels. Knowing your way in a website is not something to learn while programming a crawler: it’s a prerequisite.
We’ll compile the step-by-step process of Memorious installation, crawler setup and running in the Memorious section of our user guide.
Combining Memorious and Aleph to create a better search engine for public websites
We now had experienced the ingestion process as well as the crawling. Combining the two would allow us to design a custom document search engine, updating itself and indexing documents on the fly. Our original intent, in a way. But, for this, we needed two things:
a source of documents narrow enough to be manageable but diverse enough to cover several topics and countries at once to be useful to as many journalists as possible;
a source that didn’t already feature a search engine or that featured a search engine that didn’t fully meet the needs of journalistic methodology (we didn’t want to make a duplicate internal search engine for a public database).
And so, we decided to use the Aleph stack to improve the search on one of the most European topics of all: competition.