Document and Image Classification withTopic Ngram Model
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Latent Dirichlet Allocation (LDA) is a popular probabilistic model for information retrieval. Many extended models based on LDA have been introduced during the past 10 years. In LDA, a data point is represented as a bag (multiset)of words. In the text case, a word is a regular text word, but other types of data can also be represented as words (e.g. visual words). Due to the bag-of-words assumption, the original LDA neglects the structure of thedata, i.e., all the relationships between words, which leads to information loss. As a matter of fact, the spatial relationship is important and useful. In order to explore the importance of the relationship, we focus on an extensionof LDA called Topic Ngram Model, which models the relationship among adjacent words. In this thesis, we first implement the model and use it in for text classification. Furthermore, we propose a 2D extension, which enables us to model spatial relationships of features in images.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:kth:diva-155771OAI: oai:DiVA.org:kth-155771DiVA: diva2:762932