Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An Object-Oriented Data Analysis approach for text population
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Mathematical Statistics.
2018 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Objektorienterad dataanalys av textpopulationer (Swedish)
Abstract [en]

With more and more digital text-valued data available, the need to be able to cluster, classify and study them arises. We develop in this thesis statistical tools to perform null hypothesis testing and clustering or classification on text-valued data in the framework of Object-Oriented Data Analysis.

The project includes research on semantic methods to represent texts, comparisons between representations, distances for such representations and performance of permutation tests. Main methods compared are Vector Space Model and topic model. More precisely, this thesis will provide an algorithm to compute permutation tests at document or sentence level to study the equality in terms of distribution of two texts for different representations and distances. Lastly, we describe the study of texts regarding a syntactic point of view and its structure with a tree representation.

Abstract [sv]

Med ständigt ökande tillgänglighet av textvärd data ökar behovet att kunna klustra och klassificera denna data. I detta arbete utvecklar vi statistiska verktyg för hypotestestning, klustring och klassificering av textvärd data inom ramen för objektorienterad dataanalys.

Projektet inkluderar forskning på semantiska metoder för att representera texter, jämförelser mellan representationer, avstånd för sådana representationer och prestanda hos permutationstest. De viktigaste metoderna som jämförs är vektorrumsmodeller och ämnesmodeller. Mer specifikt tillhandahåller detta arbete en algoritm för permutationstest, på dokument- eller meningsnivå, i syfte att pröva hypotesen att två texter har samma fördelning med avseende på olika representationer och avstånd. Till sist används en trädrepresentation för att beskriva studiet av texter ur en syntaktisk synvinkel.

Place, publisher, year, edition, pages
2018.
Series
TRITA-SCI-GRU ; 2018:014
National Category
Mathematical Analysis
Identifiers
URN: urn:nbn:se:kth:diva-223244OAI: oai:DiVA.org:kth-223244DiVA, id: diva2:1183207
External cooperation
Politecnico di Milano
Subject / course
Mathematical Statistics
Educational program
Master of Science in Engineering -Engineering Physics
Supervisors
Examiners
Available from: 2018-02-16 Created: 2018-02-16 Last updated: 2018-02-16Bibliographically approved

Open Access in DiVA

fulltext(1104 kB)37 downloads
File information
File name FULLTEXT01.pdfFile size 1104 kBChecksum SHA-512
ac8ab08b5b3915f4d9568af43b8614b21fbfcae944fa2b0f03a11fdd67a8373c12c843ed391c85cb536fe5d2dd943727f1aaf4f9b693b6550a3eff88cd56997c
Type fulltextMimetype application/pdf

By organisation
Mathematical Statistics
Mathematical Analysis

Search outside of DiVA

GoogleGoogle Scholar
Total: 37 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 132 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf