Bridging the Gap between Data and Journalism

26. June 2015, by Štefan Emrich

While data-driven journalism (DDJ) has already been around as a buzzword for several years, yet it is still a fairly young discipline. Although an increasingly important one, living in a world in which it is becoming more and more important to understand complex socio-economical and ecological phenomena to facilitate well-informed decisions. Traditionally, journalists play an important role in this endeavor by uncovering hidden patterns and relationships to inform, enlighten, and entertain.

With the ever-growing amount and availability of data, it becomes crucial for journalists to use elements of data science in their work. Subsequently DDJ involves computer-supported data-based reasoning as well as interactive visualization. And although DDJ has received attention from different communities and found its way into a number of well-known news corporations such as the New York Times or the Guardian, the majority of journalists still face significant obstacles hampering the utilization of data for their work.

Specifically, three main gaps can be characterized: the usage gap (usable systems), the technology gap (dealing with heterogeneous data), and the workflow gap (facilitating DDJ in daily workflows). Firstly, journalists are often not trained in programming and data analysis impeding the usage of currently available tools that necessitate sophisticated technical expertise. Secondly, journalistic inquiry almost always demands the weaving of complex, heterogeneous data sources. Most available analysis techniques focus on specific data structures and cannot deal with more complex heterogeneous data sources, which is also a major challenge in Visual Analytics (VA). Thirdly, journalists are supported by IT infrastructure and follow a specific workflow in the news production process under tight pressure of time and resources. DDJ is not well covered by this workflow and not supported by the IT systems in the background.

The goal of the VALiD project is to bridge these gaps by (1) following a user-centered and problem- driven research process, (2) designing techniques to support data journalists in dealing with complex heterogeneous data, and (3) developing a set of guidelines and best practices for DDJ workflows.

In the past “data” may have been understood as (primarily) tables and numbers. But this does definitely not hold true today. In the digital age all media has turned into data – videos, pictures, sound and text. This of course opens new opportunities for data journalists, as they can access information from a wide range of (heterogeneous) data. But at the same time this also poses a huge technological challenge.

Because heterogeneous data is such a large field with many different facets, in VALiD we will be focusing specifically on two types of data: (1) textual data over time and (2) dynamic networks combined with quantitative flows. The focus is put on these two topics as they are of special interest to (state-of-the-art) investigative journalism (e.g. “offshore-leaks”). Thus they will be at the core of the project over the next three years.


Stay informed with our semi-annual newsletter.


Privacy policy regarding newsletter subscription


Help us develop our code on GitHub:

netflower – Visual exploration of flows in dynamic networks
mtdb2 – Visual exploration of media transparency data