Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A difference analysis method for detecting differences between similar documents
KTH, School of Computer Science and Communication (CSC).
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
En differens-analysmetod för att upptäcka skillnader mellan liknande dokument (Swedish)
Abstract [en]

Similarity analysis of documents is a well studied field. With a focus instead on the opposite concept, how can we try to define and distinguish the differences within documents? This project tries to determine if differences within documents can be detected as well as quantified based on their semantic qualities. We propose a method for quantifying differences by applying tf-idf based models with analysis methods for lemmatization and synonym extraction, together with utility ranking algorithms. The method is implemented and tested. The results show that the method has potential but that further studies are required in order to fully evaluate to what extent it could be of practical use. Such a method could though reap significant benefits within several different fields in which automatic difference detection could replace error prone manual labor in document management, as well as other beneficial purposes such as to provide automatically generated difference summaries.

Abstract [sv]

Likhetsanalys mellan dokument är ett välutforskat område. Med fokus istället på motsatsen, hur kan vi försöka definiera och särskilja skillnaderna mellan dokument? Detta projekt försöker undersöka om skillnader mellan dokument kan detekteras samt kvantifieras baserat på deras semantiska kvalitéer. Vi föreslår en metod för kvantifiering av skillnader genom att applicera tf-idf baserade modeller tillsammans med analysmetoder för lemmatisering och synonymextrahering, i kombination med utilitetsrankningsalgoritmer. Metoden implementeras och testas. Resultaten visar att metoden har potential men att det krävs ytterligare studier för att fullt ut avgöra till vilken grad den skulle kunna vara praktiskt användbar. En sådan metod skulle dock kunna erbjuda stora fördelar för ett flertal olika discipliner, där automatisk skillnadsdetektering skulle kunna ersätta felbenägen manuellt arbete gällande dokumentationshantering, samt också fylla andra förmånliga syften som t.ex. att kunna erbjuda automatgenererade skillnadssammanfattningar.

Place, publisher, year, edition, pages
2017.
Keyword [en]
difference, analysis, similar, document, relative, utility
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-219566OAI: oai:DiVA.org:kth-219566DiVA, id: diva2:1163624
Educational program
Master of Science in Engineering - Computer Science and Technology
Supervisors
Examiners
Available from: 2017-12-13 Created: 2017-12-07 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

fulltext(923 kB)12 downloads
File information
File name FULLTEXT01.pdfFile size 923 kBChecksum SHA-512
916f2b544e322e021f1734a8d40d52f74914cf25a7b01168576f2758aa48961003e263eba288ceb3397c66be980547f592480a686c09dd41d5d74832e1ad0fb0
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 12 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 22 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf