Talking about data

In the week before Christmas, I headed off to Exeter University in England to attend two different workshops. The first (15-16 December) was called ‘Knowledge/Value and Dark Data’, and was the fifth and final workshop in the series. This particular workshop was organized by Kaushik Sundar Rajan (University of Chicago), Gail Davies, Sabina Leonelli and Brian Rappert (all based at Exeter). The second (17-19 December) was the first exploratory workshop for an ERC-funded project led by Sabina Leonelli about ‘data-intensive science’. Each workshop had about 25 participants, some – including me – showed great stamina, and participated in both. Participants included philosophers, historians, geographers, sociologists, economists and anthropologists of science. The first workshop focused around a series of pre-circulated papers, touching on different aspects of ‘dark data’. The second was indeed more exploratory, in which all of the invited participants presented examples from past and ongoing research about what data and data intensity/intensiveness means in the sciences they are studying, including high energy physics, archaeology, climate science, biomedical and life sciences, economics. I will not discuss each workshop and presentation in turn, but rather pull out a few themes that I found particularly interesting.

dark data

The first is of course, what exactly is meant by ‘dark data’. In my invited commentary at the end of the workshop, I distinguished between four types:

  1. Data that are inaccessible to current ways of knowing, in other words data we can no longer access because we don’t have the right equipment, methods or techniques.
  2. Data that are deliberately kept out of view, such as for military intelligence, or competitive advantage. Of course, such data are accessible to those who work with them on a daily basis.
  3. Data about taboo or very sensitive topics that may be difficult to generate or record, such as data about historic and present child abuse.
  4. Taken-for-granted data that we perhaps don’t recognize as data, due to disciplinary or cultural assumptions.

None of these suggest essential qualities of the data. At the very least we need to pay attention to temporality, as data become inaccessible over time. Several contributors also reminded us also to pay attention to the work and technologies that are needed to make data more or less visible. Enormous effort is needed to make data amenable to analysis.

dataoverload

Some of these issues returned in the second workshop, which started with different people giving examples of data-intensity in the sciences in which they worked or studied, including genetics, oncology, plant biology, economics, physics, sociology, archaeology, climate science, and science and technology studies (STS) (the list is deliberately broad). In the opening session, Sabina Leonelli defined data as the product of research activities that are used to make knowledge claims, and which can be circulated and re-used. While one could (and many did – such is the nature of academic debate) quibble with such a definition, and some parts apply more closely in some sciences than others, it helped to start the very interesting discussion that continued over three days. One interesting theme concerned ‘stewardship’ of data, of importance to archaeologists who take seriously their responsibility to the past, for whom their data are often part of the historical record and/or of continued religious or social significance to particular groups of people.

In other instances, such as where personal privacy, integrity and well-being are involved, destroying data collected for medical or social research might best serve ethical concerns. We heard about some of the extremes of data abundance, such as those involved in the ATLAS project at CERN, who could not possibly capture all of the data generated in their experiments, so are reliant on computer algorithms to capture the ‘interesting cases’ for subsequent analysis. This resonated with some of the other sciences in which increasingly medical researchers ‘are not testing drugs but are testing algorithms’. Whether it is a case of computer models being used to generate (or reduce) data, the importance of digital technologies in data-intensive sciences was inescapable. This suggests that understanding what the technology can and cannot do becomes ever more important, in all of the disciplines, not only ‘big science’.