The study of literature has traditionally been a qualitative scientific endeavour. Researchers have, generally, analysed few and canonised works, and these works have been examined in great detail, with “close readings” being the typical choice of method. Franco Moretti’s term “distant reading” and the rise of digital humanities and different sorts of text mining methods have, at least partly, changed this.
Moretti and his ilks in a way turned the scholarship of literature upside down by focusing units much bigger or much smaller than singular works of literature: “devices, themes, tropes – or genres and systems”, as Moretti put it. Instead of reading the most well-known works of a specific period or genre it was suddenly possible to read almost all literature published and draw other sorts of conclusions (though this also meant handing over much of the reading process to computers).
Most literary scholars engaged in text mining have used very large data corpuses. The general rule seems to have been “the bigger data, the better”. This is certainly true when it comes to showing statistical patterns etcetera. However, the bigger the material, the longer is also the distance between the machine-generated results and the qualitative analysis of these results. If your corpus consists of thousands of books it is simply not possible to know the content of this corpus very well. This is at the same time the strength and the weakness of the text mining research on literature conducted in recent years.
In my opinion, debates about pros and cons of text mining methods in the study of literature have been far too black and white. Instead of either or (big data or canonised work, distant reading or close reading) I argue for a position in between. My field of study is contemporary Swedish crime fiction. In an on-going study I use a corpus of 116 Swedish crime fiction novels published 1998–2015 and written by the most well-known and commercially successful authors of this period of time. Hence, I do not analyse the entire genre, yet not only the most renowned novels within it, but instead around ten per cent of all Swedish novels published in this period (the top decile). This choice makes it possible to both get the bigger picture and be very familiar with the material.
Moreover, I approach this corpus through a combination of methods, where some are computer aided and digital (word frequencies, topic modelling), others more traditional and analogue (shallow thematically-oriented readings of the entire corpus). Together these methods provide solid knowledge of the genre that is both quantitative and qualitative.
In my presentation I argue that such a combination of methods on semi-big data or corpuses can be very fruitful to many literary studies, with material from different epochs and genres. Literary scholars should start to make use of this “space between” the very distant and the very close, and let computer-aided methods serve as a helping hand rather than a goal in itself.