Supporting the Exploration of a Corpus of 17th-Century Scholarly Correspondences by Topic Modeling.
2011 (English)Conference paper (Refereed)
This paper deals with the application of topic modeling to a corpus of 17th-century scholarly correspondences built up
by the CKCC project. The topic modeling approaches considered are latent Dirichlet allocation (LDA), latent semantic
analysis (LSA), and random indexing (RI). After describing the corpus and the topic modeling approaches, we present an
experiment for the quantitative evaluation of the performance of the various topic modeling approaches in reproducing
human-labeled words in a subset of the corpus. In our experiments random indexing shows the best performance, with
scope for further improvement. Next we discuss the role of topic modeling in the CKCC Epistolarium, the virtual research
environment that is being developed for exploring and analysing the CKCC corpus. The key feature of topic modeling is
its ability to calculate similarities between words and texts. In an example we illustrate how such an approach may yield
results that transcend a regular text search.
Place, publisher, year, edition, pages
University of Copenhagen , 2011.
topic modeling, latent semantic indexing, random projection
Computer and Information Science Language Technology (Computational Linguistics)
Research subject Library and Information Science
IdentifiersURN: urn:nbn:se:hb:diva-6661Local ID: 2320/9689OAI: oai:DiVA.org:hb-6661DiVA: diva2:887360
SDH 2011 Supporting Digital Humanities: Answering the unaskable