Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
DRESS: dimensionality reduction for efficient sequence search
Visa övriga samt affilieringar
Antal upphovsmän: 5
2015 (Engelska)Ingår i: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 29, nr 5, 1280-1311 s.Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.

Ort, förlag, år, upplaga, sidor
2015. Vol. 29, nr 5, 1280-1311 s.
Nyckelord [en]
Similarity search, Alphabet reduction, Biological sequences
Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-120892DOI: 10.1007/s10618-015-0413-2ISI: 000359996900008OAI: oai:DiVA.org:su-120892DiVA: diva2:860541
Tillgänglig från: 2015-10-12 Skapad: 2015-09-18 Senast uppdaterad: 2018-01-11Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas

Övriga länkar

Förlagets fulltext

Sök vidare i DiVA

Av författaren/redaktören
Papapetrou, Panagiotis
Av organisationen
Institutionen för data- och systemvetenskap
I samma tidskrift
Data mining and knowledge discovery
Data- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 18 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf