Published On: March 31st, 2025

CASTARTER: an R Package for Text-Mining

Data-journalism often implies using a great variety of tools to make the most out of available data. It is with this idea in mind, that our partners are always looking for new, more efficient ways to work with data, and often their effort is then transposed into something tangible, that all data-journalists can have access to. That’s what Giorgio Comai (OBC Transeuropa) did by creating a text-mining R package, which has now been released in a new, enhanced version.

The tool is now much more flexible and easier to use in a broader range of situations, including extracting an arbitrary number of textual and non-textual fields from web pages or other web formats returned by APIs (e.g. json, xml).

Version 0.3 is also the result of addressing two major pain points of the text mining process. First, a sitemap-based workflow that facilitates retrieving and updating datasets based on multiple sections of a website more quickly, with reduced load on the server. Second, an automatic content and metadata extraction process based on the `readability` library has been introduced. All these changes have been introduced aiming at making the user experience much smoother and the tool fully functional in a large number of situations.

In order to make the tool more usable and show its potential, we published a dedicated workflow providing a robust and repeatable method for systematically collecting and organizing textual data from web sources. It’s designed to efficiently acquire large volumes of specific online content, even for users without extensive programming knowledge.

💻 Check out Castarter 0.3 here.

🔎 Explore the section dedicated to the text mining workflow here.