Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Compression of finite-state automata through failure transitions
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2014 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 557, 87-100 p.Article in journal (Refereed) Published
Abstract [en]

Several linear-time algorithms for automata-based pattern matching rely on failure transitions for efficient back-tracking. Like epsilon transitions, failure transition do not consume input symbols, but unlike them, they may only be taken when no other transition is applicable. At a semantic level, this conveniently models catch-all clauses and allows for compact language representation.

This work investigates the transition-reduction problem for deterministic finite-state automata (DFA). The input is a DFA A and an integer k. The question is whether k or more transitions can be saved by replacing regular transitions with failure transitions. We show that while the problem is NP-complete, there are approximation techniques and heuristics that mitigate the computational complexity. We conclude by demonstrating the computational difficulty of two related minimisation problems, thereby cancelling the ongoing search for efficient algorithms.

Place, publisher, year, edition, pages
Elsevier, 2014. Vol. 557, 87-100 p.
Keyword [en]
failure automata, pattern matching, automata minimisation
National Category
Computer Science
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-93329DOI: 10.1016/j.tcs.2014.09.007ISI: 000343784800008OAI: oai:DiVA.org:umu-93329DiVA: diva2:747628
Funder
Swedish Research Council, 621-2011-6080
Available from: 2014-09-17 Created: 2014-09-17 Last updated: 2017-12-05Bibliographically approved
In thesis
1. A novel approach to text classification
Open this publication in new window or tab >>A novel approach to text classification
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. 

To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. 

In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2017. 176 p.
Series
Report / UMINF, ISSN 0348-0542 ; 17.16
Keyword
Text classification, natural language processing, automata
National Category
Language Technology (Computational Linguistics) Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-138917 (URN)978-91-7601-740-1 (ISBN)
Public defence
2017-09-29, N430, Naturvetarhuset, Umeå, 13:00 (English)
Opponent
Supervisors
Available from: 2017-09-04 Created: 2017-09-03 Last updated: 2017-09-04Bibliographically approved

Open Access in DiVA

fulltext(755 kB)171 downloads
File information
File name FULLTEXT01.pdfFile size 755 kBChecksum SHA-512
9923c373f1457fa327c2dc9353989d88aef920c2a57068a966d593280bfbec35d68b05ae75faf023669079d4d6893bbf361dfd0959834bd5c2405e3b88e1aef0
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Björklund, HenrikBjörklund, JohannaZechner, Niklas
By organisation
Department of Computing Science
In the same journal
Theoretical Computer Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 171 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 392 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf