Tracking down the habitat of folk songs

“And the song of the prince and the princess, did you not know that one yet?”
“Not in so much detail. Did you sing this song yourself?”
“I sang them all.”
“When was this?”
“Well, during work. And at dusk, when we all sat together. We sang a lot then.”
Elshout nodded. He turned his coffee cup around, carefully finished his coffee, put the cup down, and looked at her again.
“Do you still know the melody of the prince and the princess?”
“Yes,” she said, hesistating, “but I cannot sing it anymore.”
“Could you give it a try?” He leafed through her notebook and pushed her own text towards her.
“I don’t know if I still can,” she said shyly.
“It’s going to be fine.” He pushed the buttons of the recorder and held the microphone in her direction.
(Translated from J.J. Voskuil, Het Bureau, Vol. 2.)

This little scene from Voskuil’s epic series of novels, Het Bureau, paints part of the story behind the collection of Dutch folk songs housed at the Meertens Institute: we catch a glimpse of Ate Doornbosch, one of the collectors, coaxing a song from an elderly lady who had sent him a notebook with folk song lyrics she could still remember from her childhood. This might be the song recorded on that day:

Approximately 50 years after this scene, the songs collected by Ate Doornbosch and others are the starting research material of the Tunes & Tales project. With our computational research, we largely focus on the digitized transcriptions of the songs, comparing the notes of related folk songs with each other. The variants seem to have a life of their own: like organisms, they share a large amount of melodic material, while other parts of the melodies can be very distinct. These observations have inspired terminology such as “tune family”, which groups related songs together as stemming from the same ancestor melody; variation is then like the little mutations that occur from one generation to another.

But many of the folk songs in the collection also are the last of their kind: they have not made it to another generation of singers. They have fossilized, as it were, thanks to recording and digitization technology; they can still be listened to and studied. However, it is unlikely that they are going to be as widely known again as they once were.

This led to the question of whether there is anything we can learn about the habitat of the songs when they were still alive and kicking, and how the conditions had changed by the time the songs were recorded. It is probably not a question easily answered, but the Dutch folk song database contains some information about the singers and the surroundings in which they learned the songs, as can be seen in this small biography of the singer heard in the recording, which states that she was born in Onstwedde (in the Dutch province Groningen) in 1895, and that her father was a farm hand and later a shopkeeper.

Some of this data, such as occupation, year of birth, and location, is remarkably similar to the information contained – albeit on a much, much larger scale – in the historical censuses data from the CEDAR project. This made us curious to see if we could connect both datasets, and whether we could visualize the “habitat” of the folk songs, the singers and their socio-economic environment. Would different generations group together around 1900, as young and old worked in similar occupations, providing the occasions in which singing together and passing on songs could occur? Would this be less common in the middle of the 20th century, as working and living environments changed?

Problems linking the Dutch folk song database to historical counts
The data from the Dutch folk song database, such as given in the example above, contain a wealth of information, but unfortunately, they are not necessarily structured. For some singers, there is a separate “occupation” field, but sometimes, this information is part of a longer text, and therefore much harder to process computationally, or it might even be absent altogether. This led us to the decision to depart from one case, that of a singer who grew up on a farm in the province of Groningen, and who still lived in the same area at the time of the recording. For Groningen, we would compare two years of statistical counts from the CEDAR dataset, one in which the singer would have been a child or teenager, and one close to the recording date.

The problems in finding suitable historical counts using the Dutch historical census data are related to the changing nature of the census. Throughout its own history different questions, processing methods, variables, classifications systems and so forth have been introduced, making it difficult to easily compare the data across time. The data are represented on different levels. For some years we have very detailed counts and classifications, which are lacking for other years. Moreover, data may be represented on provincial or municipal levels.

We chose the historical counts from the years 1899 and 1947 for this reason: they were most comparable in their representation of occupational and locational data. The singer from our example above was only 4 in 1899, so she probably was not actively learning and singing the folk songs then. But the later count of 1909 made it more difficult to access data on a provincial level. In 1947, the singer was 52, around the same age of some of the adults teaching her songs in her youth might have been.

Figureweb4Figure 1. Multiple Correspondence Analysis factor map on occupation and age of inhabitants of Groningen, 1899.

fig2Figure 2. Multiple Correspondence Analysis factor map on occupation and age of inhabitants of Groningen, 1947.

Mapping discrete dimension values through dimensionality reduction
The data of the Dutch historical censuses are represented following a multidimensional model called RDF Data Cube. The fundamental aspect of this model is that it represents observations as a set of dimensions affecting a measurement. For instance, a valid observation in RDF Data Cube would be a statement such as “12 persons, who happened to be single women, were aged below 20 and worked as diamond polishers in Amsterdam”. 12 persons would be the measurement, while marital status, sex, age, occupation, and location would be dimensions with specific values (“single”, “below 20”, “diamondpolishers” and “Amsterdam”, respectively).

A usual problem with multidimensional data is that high-dimensionality is almost impossible to represent in a way that can be explored, analysed and understood by humans. In the example above, six different dimensions are representing one single observation. How could the single data point [12 persons, women, single, below 20, diamond polisher, Amsterdam] be represented graphically? Moreover, how could all points of our dataset be represented in a low (say, two or three) dimensional space so that we could visually explore how these points get organized?

This problem is tackled thoroughly by dimensionality reduction techniques, and concretely by feature extraction techniques, that aim to transform data in a high-dimensional space to a space of fewer dimensions. A popular technique in this area is multiple correspondence analysis (MCA). MCA, a popular extension of correspondence analysis, finds and ranks orthogonal components and dimensions according to how well they explain the high-dimensional input data in terms of the amount of variance accounted for. As a result, one can usually pick up the first two of these dimensions (i.e. the two that better explain the variance of the data) and represent the data points according to them, resulting in maps like those shown in Figures 1 and 2.

These maps can then be interpreted by simple exploration. For instance, the points close to the coordinates origin represent the data points more often seen in the dataset. Distances are also meaningful: when values are represented close to each other in the map it means that they occur often together in the data. This eases explaining trends in the data.

Interpreting the MCA visualizations
In Figure 1, we see the MCA from the historical counts for Groningen in 1899. The young ages (encoded with both their age and their approximate years of birth) are all clustered to the left side of the map, as are the women. Workers are mostly clustered in the first quadrant: crafts and industry, but also farming (“Landbouw”) are clustered closely together with men aged 22-35. Older ages cluster with occupations such as trade and insurances, in the fourth quadrant. Only the textile industries are an island on the left of the map, close to ages 14-15. Is our young singer from Groningen not representative for her generation then, as we would have expected young ages to cluster with the farmwork? A more likely interpretation is that a lot of work is invisible in the census, as young people helping out on their parents’ farm might not have been seen as an occupation as such.

In Figure 2, based on the historical count of 1947, this kind of work is represented as “medew. gezinsleden”, family members working along in a business, which cluster most closely to women and young people between 14-24 years of age. Men and people aged 25-39 are still most active in farming and manual work, older workers tend to own their own business (“bedrijfshoofd”). It is hard to see a change in the constellations of the generations from Figure 1 to Figure 2, as the occupation names change (the earlier census having much finer categories). Our hope to see different generations clustered in 1899, and less so in 1947, is not fulfilled.

Discussion and outlook
The fertile grounds for folk song transmission remain an elusive concept: the contact between the generations that we hoped to see in 1899 is not visible because young people seem to have not been registered with an occupation. The predicted decrease of proximity between the generations in 1947 cannot be observed, as the category “medew. gezinsleden” actually makes some proximity visible that was not visible from the data of the earlier count.

There is still a lot of potential in linking the Dutch folk song database and the CEDAR data together, however. For instance, instead of comparing two different historical counts, it might be an option to compare different occupations in one year. For instance, in the MCA analysis from 1899 we have seen that there is not much overlap between workers in the farming and textile industries. Would the variants of folk songs sung by former textile workers then be melodically different from the variants the former farmers would have sung?

In order to address these and other questions, steps would need to be taken towards linking these datasets together automatically. This could be done by using their geographical (e.g. sdmx:refArea) and temporal (e.g. sdmx:refPeriod) properties together to find the closest census snapshots that provide historical context for a certain query. This would be a generalisation of the work presented before, in which the input queries (e.g. singers based in Groningen born in 1895) work as entry points to the CEDAR census data. As a result, the census data can provide knowledge maps (as those shown in Figures 1 and 2) describing the historical context of the closest area and time.