Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Using Elasticsearch for full-text searches on unstructured data
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

In order to perform effective searches on large amounts of data it is not viable to simply scan through all of said data. A well established solution for this problem is to generate an index based on the data. This report compares different libraries for establishing such an index and a prototype was implemented to enable full-text searches on an existing database. The libraries considered include Elasticsearch, Solr, Sphinx and Xapian. The database in question consists of audit logs generated by a software for online management of financial trade. The author implemented a prototype using the open source search engine Elasticsearch. Besides performing searches in a reasonable time the implementation also allows for documents within the index to be fully removed without causing notable disturbances to the overall structure. The author defined a pattern analyzer for Elasticsearch to allow the use of the Swedish alphabet and accented letters. The audit log database which this project concerns can contain personal information. For this reason the General Data Protection Regulation was considered during the project. This regulation is a EU-law regarding personal information. The implementation described in this report is meant to serve as a starting point to allow the finding and retrieval of personal information to run more smoothly. The author also made sure that the deletions performed can be made final to comply with the General Data Protection Regulation. When testing the implementation a database of 708 megabyte containing unstructured data was used. Searching for double search terms, a first name and a last name, in the generated index resulted in an average return time of 11.5 ms when looking for exact matches and 59.3 ms when a small level of misspelling was allowed. The measurements suggest that a solution using Elasticsearch is suitable for the presented problem.

Place, publisher, year, edition, pages
2019. , p. 32
Series
IT ; 19009
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:uu:diva-395654OAI: oai:DiVA.org:uu-395654DiVA, id: diva2:1363672
Educational program
Bachelor Programme in Computer Science
Supervisors
Examiners
Available from: 2019-10-22 Created: 2019-10-22 Last updated: 2019-10-22Bibliographically approved

Open Access in DiVA

fulltext(736 kB)30 downloads
File information
File name FULLTEXT01.pdfFile size 736 kBChecksum SHA-512
9583db8b300717fc3ae2e078b14178ce38b9f38060f09f0d7980cbde573d99fe085051034ac08c691cfb98c63737df6b1da883c52fa0a113e171e4eeafdfdb89
Type fulltextMimetype application/pdf

By organisation
Department of Information Technology
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 30 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 53 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf