Change search
ReferencesLink to record
Permanent link

Direct link
Offline Approximate String Matching forInformation Retrieval: An experiment on technical documentation
Jönköping University, School of Engineering, JTH. Research area Information Engineering.
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Approximate string matching consists in identifying strings as similar even ifthere is a number of mismatch between them. This technique is one of thesolutions to reduce the exact matching strictness in data comparison. In manycases it is useful to identify stream variation (e.g. audio) or word declension (e.g.prefix, suffix, plural).

Approximate string matching can be used to score terms in InformationRetrieval (IR) systems. The benefit is to return results even if query terms doesnot exactly match indexed terms. However, as approximate string matchingalgorithms only consider characters (nor context neither meaning), there is noguarantee that additional matches are relevant matches.

This paper presents the effects of some approximate string matchingalgorithms on search results in IR systems. An experimental research design hasbeen conducting to evaluate such effects from two perspectives. First, resultrelevance is analysed with precision and recall. Second, performance is measuredthanks to the execution time required to compute matches.

Six approximate string matching algorithms are studied. Levenshtein andDamerau-Levenshtein computes edit distance between two terms. Soundex andMetaphone index terms based on their pronunciation. Jaccard similarity calculatesthe overlap coefficient between two strings.

Tests are performed through IR scenarios regarding to different context,information need and search query designed to query on a technicaldocumentation related to software development (man pages from Ubuntu). Apurposive sample is selected to assess document relevance to IR scenarios andcompute IR metrics (precision, recall, F-Measure).

Experiments reveal that all tested approximate matching methods increaserecall on average, but, except Metaphone, they also decrease precision. Soundexand Jaccard Similarity are not advised because they fail on too many IR scenarios.Highest recall is obtained by edit distance algorithms that are also the most timeconsuming. Because Levenshtein-Damerau has no significant improvementcompared to Levenshtein but costs much more time, the last one is recommendedfor use with a specialised documentation.

Finally some other related recommendations are given to practitioners toimplement IR systems on technical documentation.

Place, publisher, year, edition, pages
2013. , 58 p.
Keyword [en]
Algorithm comparison, Approximate string matching, Information retrieval, Offline string matching, Overlap coefficient, Phonetic indexation, String distance, String metric, String searching algorithm
National Category
Computer Systems
URN: urn:nbn:se:hj:diva-22566OAI: diva2:663931
Subject / course
JTH, Informatics
Available from: 2013-12-06 Created: 2013-11-13 Last updated: 2013-12-06Bibliographically approved

Open Access in DiVA

Final report(761 kB)265 downloads
File information
File name FULLTEXT01.pdfFile size 761 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
JTH. Research area Information Engineering
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 265 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 591 hits
ReferencesLink to record
Permanent link

Direct link