Change search
ReferencesLink to record
Permanent link

Direct link
Defining Tags by Linking to Knowledge Bases
Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, Department of Computer and Information Science.
2014 (English)MasteroppgaveStudent thesis
Abstract [en]

This thesis looks into the process of automatically expanding image searches based on tags and the definitions of terms from public knowledge bases. To this end, we will try to extract terms related to a query. The process of finding these terms is known as feature extraction. The text collection on which we perform this feature extraction is, in this thesis, based on text retrieved from public knowledge bases using the original query. The program will in other words, first retrieve related documents. It will then pick out related terms using either Chi-Squared, or an approach I've coined "Neighbouring Terms", or NT. The latter is an approach that is much quicker to process, and may prove to give good precision to term extraction, despite not having to perform such a demanding process beforehand. This thesis will also look into different variables in these kind of processes to find the best approach to both Chi-Squared and NT. Because this automatic term extraction is set to work on a limited size of articles, there is a question of how many articles would be needed to get the best results. There are also several similarity models to consider when building something like this. For that reason, this thesis also looks into the different results obtained when working with models like the Vector Space model, Okapi BM25 and the Language Model. Other variables that this thesis looks into is whether or not term pre-processing, like stop word removal and stemming, are beneficial or not. Also, what gives the best results between searching for abstracts based on their title or their contents, and with how many terms can a query be expanded without losing too much relatedness. To evaluate the terms suggested by these methods, this thesis looks into the P@n values for 20 queries, as well as using metrics such as MAP (Mean Average Precision) to evaluate the sum of the results for each approach. To avoid biased evaluation, we also perform a user survey. We present the results of a survey where 32 people have given their opinion on the different terms suggested by the system, and how related to a given query they are. The main conclusion in this thesis is that NT does run faster than Chi-Squared, but while results did vary, the precision values on an average fell in favour of Chi-Squared. That said, it did not perform better by much, and with future improvements it could prove a viable solution in automatically generating semantically related terms without having to perform heavy processing.

Place, publisher, year, edition, pages
Institutt for datateknikk og informasjonsvitenskap , 2014. , 136 p.
URN: urn:nbn:no:ntnu:diva-24813Local ID: ntnudaim:8511OAI: diva2:721271
Available from: 2014-06-03 Created: 2014-06-03 Last updated: 2014-06-03Bibliographically approved

Open Access in DiVA

fulltext(2060 kB)351 downloads
File information
File name FULLTEXT01.pdfFile size 2060 kBChecksum SHA-512
Type fulltextMimetype application/pdf
cover(1210 kB)11 downloads
File information
File name COVER01.pdfFile size 1210 kBChecksum SHA-512
Type coverMimetype application/pdf
attachment(23 kB)5 downloads
File information
File name ATTACHMENT01.zipFile size 23 kBChecksum SHA-512
Type attachmentMimetype application/zip

By organisation
Department of Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 351 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 14 hits
ReferencesLink to record
Permanent link

Direct link