CEDAR meets the 1st International Workshop on Semantic Statistics

CEDAR had some planned activities in Sydney during the last weeks of October, and the 1st International  Workshop on Semantic Statistics (SemStats) was the first one. Being one of the satellite events of the 12th International Semantic Web Conference (ISWC), SemStats tried to gather together for the first time the statistician and the semantic web communities. Albert Meroño-Peñuela presented this paper on concept drift in the Dutch historical censuses, studying the comparability of statistical time series when they are represented as Linked Data.

The workshop was chaired by Franck Cotton and Raphael Troncy very nicely and effectively, keeping all presentations in time. The keynote was given by Siu-Ming Tam, head of research in the Australian Bureau of Statistics (ABS), and gave lots of insights on the hopes that statisticians have in the Semantic Web and semantic technologies; but also well argued fears about their responsibility as data curators and story tellers. Michael Mecham took then the stage to tell us about the adoption of semantic technologies in the ABS through the Concepts-Sources-Method (CSM) framework, stressing the importance that National Statistical Offices (NSOs) give to trust and giving data the right interpretation.

After the ABS presentations, it was time for talks about LInked Data. Thomas Bosch presented models for person-level data that use all the well known vocabularies of statistics in Linked Data (SKOS, XKOS, DCAT, DISCO, DDI and PROV). Franck Cotton gave lots of insights to the XKOS vocabulary, an extension of SKOS especially intended for classification systems and data comparability. Hideo Sato explained how the statistical office in Japan matches Statistical LInked Data. Laurent Lefort showcased the interesting case-study of statistical data in the medical domain, highlighting the very cool feature of nesting  data cubes when using the RDF Data Cube vocabulary. Sarven Capadisli showed  his work on converting several statistical datasets to RDF Data Cube, putting emphasis on how important SDMX is for statistical data (and for RDF Data Cube to support it). Peter Haase presented a great technique, inspired by a Google paper, to find statistical datasets related to the one we are working with (containing similar but complementary data at the same time) through an heuristic that tells us to what extent a data source “contextualises” another one. Camille Pradel gave a talk about OLAP transformations, in particular on how to translate OLAP operations into SPARQL queries using an algebra that meets the constellation model. Evangelos Kalampokis told us their approach on how to incorporate predictive models in the Semantic Web, representing these models as LInked Data and exploiting social media for predictions.

Albert presenting at ISWC

After this, it was Albert Meroño-Peñuela’s turn to present a paper on extensional concept drift in the Dutch  historical censuses. He raised the importance of the problem of comparability in time series, this is, how to compare data collected at different points in time knowing that the same concept may have changed (drifted) its meaning. The results were received with enthusiasm and, although some statisticians had comments about formalism, it was agreed that the problem exists and needs to be solved. The discussion also helped to put the problem into the (wider) context of data comparability, which was in the headlines the whole day.

The last session of the workshop consisted of the workshop challenge and the prize ceremony. In the workshop challenge, the workshop organisers encouraged participants to apply their methods and techniques to two given RDF Data Cube datasets (the Australian and French census results of 2011 and 2010, respectively). Albert Meroño-Peñuela presented a twist in concept drift’s detection technique to use non-temporal dimensions for data comparability, while Heiko Paulheim summarised the University of Mannheim’s work on massive pattern analysis and comparison of Statistical Linked Data. This great presentation made Heiko and his colleagues deserve the SemStats challenge prize, while the best paper of the workshop award was given to Sarven Capadisli for his great emphasis on reducing the gap between the two communities. Both prizes were iPad devides provided by Datalift, official sponsor of the workshop.

When finished, all the attendants shared the feeling that the outcomes of this 1st International Workshop on Semantic Statistics were very valuable for both communities. From the statisticians’ side, it was perceived that the semantic technology is very mature, and that publishing and consuming statistical structured data on the Web can lead to statistical analysis of the maximum rigour. From the semantic web side, the take home message was that more work has to be done to ensure data comparability, trust of sources, and provision of insightful and rigorous interpretations beyond raw data. Overall, a great journey of knowledge sharing, interdisciplinary learning, and massive social networking that will (hopefully) follow-up in ISWC 2014.