Who is the next Sally Wyatt?

Inspired by the workshop Stylometry at the European Summer School for Digital Humanities

If you are thinking about becoming a stylometrist, you should definitely learn from Europe’s finest: Maciej Eder and Jan Rybicki. Together with Mike Kestemont, they have developed a library in R with which everyone who has computational stylometric aspirations can perform their analyses very thoroughly. More information you find on the website of their Computational Stylistics Group. I attended one of their workshops in the summer as an attempt to master their stylo package. This particular one was held at the European Summer School in Leipzig. In order to demonstrate what I have learned, I will show which one of the eHumanities project members will be the next Sally Wyatt, style-wise.

Workshop Stylometry at the European Summer School
Before I traveled to Leipzig, I was not completely unfamiliar with stylometry in R. Maciej Eder was a visiting fellow at the eHumanities group from April until June 2013. During his stay in the Netherlands, I attended several of his lectures about authorship attribution and the possibilities for related stylistic research questions. This really inspired me for my own research into the formal characteristics of modern Dutch novels [please read my personal blog], which made me decide to pack my bags and travel to eastern Germany.

During the two weeks in Leipzig, five other students and I acquired the basic knowledge of Stylometry and learned how to perform our own research with the software. Fortunately, Jan and Maciej have developed a GUI, so non-computer scientists like myself do not have to deal with command lines in the mysterious R language. Although it was 40 degrees, we were seated in a room without AC and any sunlight, Maciej and Jan managed to keep everyone attentive, inspired and entertained. What made me even more enthusiastic were the really interesting and publishing-worthy results that I found in the corpus that I brought, after only three days! But, because I am waiting to get my abstract about these results accepted by the Digital Humanities Conference 2014, I cannot go into further detail about them here. Instead, in order to show what I have learned, I will perform a little stylometry task on publications written by PhD students, post-docs and professors of the eHumanities projects funded by the KNAW: CEDAR, Tunes & Tales, Elite Network Shifts and The Riddle of Literary Quality. The question that I would like to get answered is: who is the next Sally Wyatt?

The Sally Wyatt attribution task
We all admire her, our project leader of the eHumanities Group,  but we’d rather be her. Who could actually become the next Sally Wyatt? I will answer this question  by examining whose writing style is most similar to Sally’s.

1.  Compiling the corpus
The first step of the process was harassing my eHumanities colleagues. Since I am interested in their academic writing style, I asked them to give me a couple of their publications. Because Sally is a native English speaker, the publications all had to be written in English. Everyone was willing to cooperate, but still some problems occurred. First of all it must be said that within the field of Digital Humanities most research is performed collaboratively, because very often different skills need to be combined. Therefore, most articles are being written with co-authors. As a result not that many single-authored publications were available. The second problem was that some contributions appeared to be very extensive (the corpus includes a number of theses), whereas the length of the papers is usually limited to a couple of pages. And last but not least, not everyone had published since they joined the eHg. This means that if one of these contributions is most similar to Sally’s writing style, the author concerned could have had similar background experience. However, because the number of contributions was limited due to problem 1, I decided to adopt them in the analysis anyway.
In the end, I gathered 44 publications: 42 by the projects members and two by Sally (e.g. Table 1). Some articles are written by more authors, but I decided to only mention the eHumanities members (other authors are referred to as et al.).

Table 1: Corpus

tabel1web

2. Cleaning the content
Before I could start with the stylometric analyses, I had to remove all the content that is not of any relevance for the stylometric measurements, such as the front and back matter of the publications. I decided that the core text that was going to be analyzed, begins at the introduction and ends after the acknowledgements. Title pages, abstracts, notes, references, appendixes, etc, were manually removed.

3.  Unsupervised stylometric analyses
In order to computationally examine Sally’s nearest neighbor, I started with an unsupervised stylometric approach. This type of stylometry is based on similarity detection. It is usually used when there is no prior knowledge of the author, clusters, or categories (such as gender or genre). However, quite often this knowledge is already known. In that case this stylometric approach can be used to test the categories and to interpret them. The difference with the supervised methods is that a set of documents of for instance known authorship is used to classify a document of unknown authorship by looking for its nearest neighbor. Of course this attribution task can also be performed with known authorship to compare writing styles.

a.   100 Most Frequent Words
The stylo package in R compiles a word frequency list for the entire corpus. In the first analysis, I look for the 100 most frequent words in the corpus. Usually the most frequent words (MFWs) are function words (prepositions, articles, determiners, pronouns) and not content words, because the latter are very topic dependent. However, the list of the 100 MFWs for this corpus shows that 20 out of the 100 words are content words after all. Words like data, research and results are of course very frequently used in academic writing: The, of, and, to, a, in, is, for, that, as, we, are, this, be, on, which, with, by, it, an, not, can, from, have, or, more, these, data, at, research, our, between, has, all, also, one, their, will, but, other, use, information, two, they, model, number, some, such, different, was, there, organizations, text, only, results, than, first, were, time, used, new, been, no, based, systems, both, middle, if, do, most, each, using, into, features, analysis, table, managers, so, about, same, test, when, social, however, what, set, how, structure, melodies, within, level, well, where, methods, study, top, control, its, digital, approach. Figure 1 shows the distances between the publications. The smaller the distance, the more the styles are similar. As you can see, most of the publications cluster together by author, except for Jautze_Measuring, Koolen_Conference, Karsdorp et al_Idenfitying, Karsdorp et al_In search, Reinanda et al_Identifying and Ashkpour_Thesis. This is probably due to the fact that not all the contributions are single authored, which means that writing styles of some articles are intertwined, whereas those of other publications are not. It could also be the case that some topics require different kind of function words or academic content words than others, as a results of what not the authors but rather the topics cluster together. By the way, this first dendogram argues that Koolen’s Conference is the nearest neighbor of Sally Wyatt.

Figure 1: A Consensus Tree showing similarity and distances between the authors based on the frequencies of 100 MFW. The colors and initials indicate the different projects.

Kim's-stylometry-figure-1web

This problem concerning topic could perhaps (partly) be solved by removing the most content specific words. The corpus can be culled. When you cull for instance at 100%, all the words that are unique for individual texts are removed. Let’s see what happens if we cull the corpus at 100%.

b.   Corpus culled at 100%
Unfortunately, the before mentioned problem is not solved with regard to authorship attribution. First of all it must be noticed that only 29 MFWs are shared by all the publications in the corpus, after 100% deleting words that are characteristic for individual (or only a couple of) texts. Certain publications suddenly no longer cluster together with their brothers or sister, whereas they did in figure 1. With regard to style, it is hard to explain why these clusters have changed. But, before I will continue with other analyses, it must be said that according to this consensus tree, now Ashkpour’s Thesis is the nearest neighbor of Sally Wyatt.

Figure 2: A Consensus Tree showing similarity and distances between the authors based on the frequencies of 29 MFW. The colors and initials indicate the different projects.

Kim-stylometry-figure-2web

c.   Principal Components Analysis
How are the authors and publications distributed by a Principle Components Analysis (PCA)? The MFWs can be used as variables. These word-variables all have their own weightings for the two components in a PCA. According to these word-weightings the texts can be scored in the matrix based on the frequency of each word in these texts. Figure 3 shows how the authors and publications are distributed, and figure 4 shows the weightings for the word-variables. According to this PCA, Hicks, Kranenburg, Karsdorp, Koolen and Gueret&Scharnhorst are all mapped very close to Sally.

Figure 3: A PCA showing the plotting of the authors based on the weightings of 29 MFW
Kim-stylometry-figure-3web

Figure 4: A PCA showing the plotting of 29 word-variable weightings
Kim-stylometry-figure-4web

d.   Bootstrap Consensus Tree
In order to prevent cherry picking from a couple of cluster analyses (choosing the one(s) with which a certain amount of MFWs and a certain percentage of culling suits best with your hypothesis) you can mean them. Figure 5 is a Bootstrap Consensus Tree (BCT) that is a mean of ten cluster analyses, varying from 100 – 1000 MFWs with an increment of 100, and the culling varies from 0%  to 100% with an increment of 20. This means that the BCT has a lot of iterations over random chosen data in order to filter out the stable elements. You could say this is a meta-dendogram.

Figure 5: A Bootstrap Consensus Tree showing average similarity of texts based on the frequencies of 100 – 1000 MFW
Kim-stylometry-figure-5web

In this consensus tree almost all authors are now clustered together, but for Jautze_Measuring, Reinanda_Performance and Ashkpour_thesis are exceptions. As is Koolen’s Thesis, which clusters together with Sally Wyatt’s two publications.
To summarize, the different unsupervised methods each suggested different nearest neighbors of Sally’s. It is now time to try a supervised method in order to attribute Sally’s texts by training a classifier.

4.  Supervised stylometric analysis
For the classification task I apply the Nearest Shrunken Centroid (NSC) method. The training data set consists of at least one publication by each individual author, in the test data set the remaining publications (single- or co-authored) are included. Table 2 shows the classification results. The overall general attributive success is 78.2% with an SD of 8.8%. The best performances reach as high as 88,2% accuracy, these are marked yellow.

Table 2: Classification results in accuracy

tabel-2-featuredweb

The final step is to let the classifier attribute the two publications written by Sally. At almost all the settings (and also at the four with the highest performances: 400 MFW, culling 0% and 20%; 500 MFW, culling at 20%; 700 MFW, culling 20%) the machine attributes Wyatt’s Ethics proofs to Gueret and her Oratie to Koolen.

5.  Conclusion
The different stylometric approaches suggested different nearest neighbors of Sally Wyatt, but there is one author that is suggested four times. To summarize, the first analysis based on the 100 MFW (culled at 0%) suggested Corina Koolen with her publication Conference. Once we threw out the words that are characteristic for single or a couple of texts (and only the 29 hard-core function words remained) Ashkan Ashkpour’s Thesis was suggested. The PCA in which the authors were mapped according to the word-variable weightings of these 29 function words, showed that Jacky Hicks, the duo Christophe Gueret and Andrea Scharnhorst, Folgert Karsdorp, Peter van Kranenburg and again Corina Koolen were mapped very closely to Sally. The mean of ten cluster analyses tells us that again Corina Koolen (with her Thesis) has the most similar writing style. Once we had trained a classification model, both Christophe Gueret as well as Corina (AGAIN!!) were proposed at almost all the different feature settings.  So, this means that we have a winner. Corina, congratulations on your future position!

I would like to conclude with a brief explanation of this result.
First of all it must be noted that their topics are somewhat similar. To some extent they both explore in what way the hybrid digital environment and increasing digital communication technologies affect humans; as readers (Corina, Thesis and Conference), in everyday life (Sally, Oratie), or as academics (Sally, Ethics Proofs). Furthermore, Corina examines if the responsibility of the construction of textual coherence in humanities research shifts from the text’s author to its reader (Thesis, p. 5), and Sally reflects on the adjustments of traditional academic research in the digital age. So this indicates that they both focus on topics like digital environments, information and communication technologies and research. This could explain why in most of the stylometric analyses their styles are most similar, especially because 20% of the MFWs were content words (such as research, text, digital). However, the styles could not be most similar based on these content words alone, because still 80% of the MFWs were function words, and the fact that authorial fingerprints are mostly caused by the employment of a set of function words typical for that author. So, apart from the content words that can be explained by the overlapping topics, they must have the most similarity in the use of function words as well. An interesting remark is that Corina is not only Sally’s nearest neighbor stylistically, but also geographically. As far as I am concerned, she is the only eHg member who has been studying in Canada, Sally’s motherland. Could that have anything to do with their similarity in style as well? For now, that remains the question.

Acknowledgements
I would like to thank everyone who was willing to cooperate and who provided me with their publications on such short notice.