CEDAR

Project team
Ashkan Ashkpour, Albert Meroño Peñuela, Christophe Guéret, Andrea Scharnhorst, Reinier de Valk, Frank van Harmelen, Kees Mandemakers, Stefan Schlobach, Onno Boonstra

CEDAR (Census Data Research)
From fragment to fabric – Dutch census data in a web of global cultural and historic information1 is a multidisciplinary national research project. It is funded by the Royal Netherlands Academy of Arts and Sciences2 (KNAW) as part of the Computational Humanities Programme3. Its participants are Data Archiving and Networked Services4 (DANS), the VU University Amsterdam5, the International Institute of Social History6 (IISH) and the Erasmus University Rotterdam7.
The overall goal of CEDAR is to provide easier access to the Dutch historical census data by harmonizing and linking them [7]. The intended research audience of the project comprises historians and other humanities scholars interested in historical statistical information. To understand the importance of CEDAR, one has to know that for decades efforts have been made by the Central Bureau voor de Statistiek8 (CBS) (Dutch Central Statistical Office), DANS and others to make the Dutch Historic Census better available for the wider public as well as for research. This last step is still hampered due to lack of comparability across time (harmonization). In the Netherlands we have sources from census data going back to 1795. Up to 1971, in each decade a census has been carried out, with different questions and levels of detail in the collected information. The remaining primary sources are books in which census summary tables were published. These tables contain the aggregation of census information. Those books have been scanned. Later, a data entry project has been carried out to represent these tables as Excel files9. Both images and Excel files have been partially indexed and made available via a Content-Management-System for browsing and some search capabilities. However, the digital representation of the Dutch Historic Census in this form is not machine readable and, concerning current Big Data efforts in the Humanities, quite outdated. We use RDF to represent the Excel files in the Semantic Web, and to harmonize and link them to and from other datasets.

CEDAR seeks to answer fundamental questions about social history in the Netherlands and the world in automatic, web-scalable and reproducible ways. More concretely, the aim of CEDAR is to publish the Dutch historical censuses (1795-1971) in the Semantic Web, build generic harmonization practices and tools to deal with the changing nature of our data, and using this dataset as a starting point to build a semantic data-web of socio-historical information. With such a web we are able to more easily answer questions across heterogeneous and once disconnected sources.
Sometimes, census data alone are not sufficient to answer these questions. CEDAR exploits Web standards10 to make census data interlinkable with other hubs of historical socioeconomic and demographic information. When integrated, these hubs can better support the historical research cycle. Currently we have already connected our data to external classification systems such as AMCO (Amsterdamse Code) to harmonize municipalities over time; HISCO, for the occupational titles found in our dataset, and ICONCLASS for artistic related occupations; DBpedia for inter-domain concept descriptions; and Dutch Ships and Sailors, enriching census data with ship trade registers.

This broad aim touches unavoidably upon many interdisciplinary research areas and audiences. Publishing socio-historical data on the Web in a semantically rich and consistent manner poses fundamental challenges for Knowledge Representation and Reasoning, a key fields in Artificial Intelligence (AI). The deployment of tools and methods to achieve these goals in a reproducible and efficient way is closely related with Software Engineering and Computing. On the other hand, Social History, located at the crossroads between history and social sciences, produces fundamental research questions about social change and suggests domain-specific models, harmonizations and standards11 for socio-historical data. The interplay within Computing and the Humanities (the basic components of the Digital Humanities) in CEDAR works two-ways: (a) we use AI and computing to give infrastructure, scale, formalism and reproducibility to address Social History issues; and (b) we use Social History to inspire AI and Computing with new algorithms, practices, methods and tools.

Specific Challenges
The historical Dutch censuses have been collected for almost two centuries with different information needs at given times [1]. Census bureaus are notorious for changing the structure, classifications, variables and questions of the census in order to meet the information needs of a society. Not only do variables change in their semantics over time, but the classification systems in which they are organized also change significantly, making it extremely cumbersome to use the historical censuses for longitudinal analysis. The structures of the spreadsheets and changing characteristics of the census currently do not allow comparisons over time without extensive manual input of a domain expert. Even when converted into Web structured data, the need for harmonization across all years is a pre-requisite in order to enable greater use of the census by researchers and citizens. The goal of CEDAR is to integrate the Dutch historical censuses in these spreadsheets using Web technologies and standards; to publish the result of this integration as five-star Linked Open Data; and to investigate how semantic technologies can improve the research workflow of historians. Concretely, the main contributions of the dataset are:

– It is the first historical census data made available as LOD, integrated and Web-enabled from heterogeneous sources;
– it is released together with auxiliary resources, like historical classification schemes and integration mappings;
– it is linked to other datasets in the LOD cloud to improve its exposure and richness.
– built upon a transparent and structured harmonization workflow

Additionally, the Dutch historical censuses Linked Open Data comes with the following features:

– Historical statistics on two centuries of Dutch history, fully compliant with RDF Data Cube [2]
– Standardization and harmonization procedures encoded using Open Annotations [3]
– a harmonized and curated subset of the data for seven census years 1859-1920
– Full tracking of provenance in all activities and consumed/produced entities as of PROV [4]
– Dereferenceable URIs12
– A human browseable web front-end13
– Dataset live statistics14

Publishing the Dutch historical censuses as five-star Linked Open Data has a deep impact in the methodology that historians and social scientists have traditionally followed to study this dataset [1]. Due to the limitations of the old formats, the dataset could not be utilized to its full potential. To date, most of the research based on the historical Dutch censuses focused on specific comparable years [5]. To utilize the full potential of the historical censuses researchers have identified harmonization of the data as a key aspect, which we implement as Open Annotation. Previously, if researchers wanted to know e.g. the number of houses under construction in the Netherlands per municipality between 1859 and 1920, they had to consult 59 different Excel tables and run into laborious data transformations. Moreover, keeping track of provenance of all performed operations was cumbersome and relied on data munging and delicate assumptions. By using explicit harmonization rules and links to standard classifications for occupations, municipalities, religions and house types, researchers can get answers to their queries in a blink of a time compared to the manual way of digging into disparate Excel tables.

As five-star Linked Open Data, the census dataset is open for longitudinal analysis, especially for a study of change. Being a major interest for historical research, the change in structures of classifications, meaning of variables and semantics of concepts over time, known as concept drift [6], is a fundamental topic to explore. A set of tools built on top of the dataset is already available. For instance, social historians of the NLGIS project38 query the endpoint to get historical census data and plot it on a map. The dataset sums to other initiatives on publishing census data on the Web as RDF Data Cube [2]. To the best of our knowledge, ours is the first effort on publishing censuses with historical characteristics as Linked Data.

References
1. Ashkan Ashkpour, Albert Meroño-Peñuela, Kees Mandemakers. “The Aggregate Dutch Historical Censuses: Harmonization and RDF”. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 48(4), pp. 230-245, 2015.
2. Richard Cyganiak, Dave Reynolds, and Jeni Tennison. The RDF Data Cube Vocabulary. Tech. rep. http://www.w3.org/TR/vocab-data-cube/. World Wide Web Consortium (W3C), 2013.
3. Robert Sanderson, Paolo Ciccarese, and Herbert Van de Sompel. Open Annotation Data Model. Tech. rep. http://www.openannotation.org/ spec/core/. W3C, 2013.
4. Yolanda Gil, Simon Miles. PROV Model Primer. Tech. rep. https://www.w3.org/TR/prov-primer/. W3C, 2013.
5. Onno Boonstra et al. Twee Eeuwen Nederland Geteld. Onderzoek met de digitale Volks-, Beroeps- en Woningtellingen 1795-2001. The Hague: DANS en CBS, 2007.
6. Shenghui Wang, Stefan Schlobach, and Michel C. A. Klein. “What Is Concept Drift and How to Measure It?” In: Knowledge Engineering and Management by the Masses – 17th International Conference, EKAW 2010. Proceedings. Lecutre Notes in Computer Science, 6317, Springer, 2010, pp. 241–256.
7. Meroño-Peñuela, A, Ashkpour, A, Guéret, CDM, Scharnhorst, AM & Wyatt, S “CEDAR: Linked Open Census Data: Project Statement” DH Commons Journal, (2015) (http://dhcommons.org/journal/issue-1/cedar-linked-open-census-data)

Key publications
• Meroño-Peñuela, Albert, Ashkan Ashkpour, M. van Erp, Kees Mandemakers, Leen Breure, Andrea Scharnhorst, Stefan Schlobach and Frank van Harmelen. 2014. ‘Semantic technologies for historical research: A survey’ Semantic Web Journal, 6 (6), pp. 539-564, DOI: 10.3233/SW-140158

• Ashkpour, Ashkan, Albert Meroño-Peñuela, and Kees Mandemakers. 2015. ‘The Aggregate Dutch Historical Censuses. Harmonization and RDF’ Historical Methods: A Journal of Quantitative and Interdisciplinary History, 48 (4), pp. 530-545, DOI: 10.1080/01615440.2015.1026009

• Meroño-Peñuela, Albert, Ashkan Ashkpour, Christophe Guéret and Stefan Schlobach. 2015. ‘CEDAR: The Dutch Historical Censuses as Linked Open Data’ Semantic Web Journal. http://www.semantic-web-journal.net/content/cedar-dutch-historical-censuses-linked-open-data-1

• Meroño-Peñuela, Albert, Ashkan Ashkpour, Christophe Guéret, Andrea Scharnhorst and Sally Wyatt. 2015. ‘CEDAR: Linked Open Census Data: Project Statement’ DH Commons Journal, 1 (1). http://dhcommons.org/journal/issue-1/cedar-linked-open-census-data

1 See http://cedar-project.nl/
2 See http://knaw.nl/
3 See http://ehumanities.nl/
4 See http://dans.knaw.nl/
5 See http://vu.nl/
6 See http://socialhistory.org/
7 See http://www.eur.nl/
8 See http://www.cbs.nl/
9 See legacy data at http://www.volkstellingen.nl/ and https://github.com/CEDAR project/DataDump/
10 See http://www.w3.org/standards/
11 See http://www.clio-infra.eu/
12 See http://lod.cedar-project.nl:8888/cedar/page/harmonised-data-dsd
13 See http://lod.cedar-project.nl/cedar/
14 See http://lod.cedar-project.nl/cedar/stats.html