Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Web Scraper For Forums: Navigation and text extraction methods
KTH, School of Information and Communication Technology (ICT).
KTH, School of Information and Communication Technology (ICT).
2017 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Web forums are a popular way of exchanging information and discussing various topics. These websites usually have a special structure, divided into boards, threads and posts. Although the structure might be consistent across forums, the layout of each forum is different. The way a web forum presents the user posts is also very different from how a news website presents a single piece of information. All of this makes the navigation and extraction of text a hard task for web scrapers.

The focus of this thesis is the development of a web scraper specialized in forums. Three different methods for text extraction are implemented and tested before choosing the most appropriate method for the task. The methods are Word Count, Text-Detection Framework and Text-to-Tag Ratio. The handling of link duplicates is also considered and solved by implementing a multi-layer bloom filter. The thesis is conducted applying a qualitative methodology. The results indicate that the

Text-to-Tag Ratio has the best overall performance and gives the most desirable result in web forums. Thus, this was the selected methods to keep on the final version of the web scraper.

Abstract [sv]

Webforum är ett populärt sätt att utbyta information och diskutera olika ämnen. Dessa webbplatser har vanligtvis en särskild struktur, uppdelad i startsida, trådar och inlägg. Även om strukturen kan vara konsekvent bland olika forum är layouten av varje forum annorlunda. Det sätt på vilket ett webbforum presenterar användarinläggen är också väldigt annorlunda än hur en nyhet webbplats presenterar en enda informationsinlägg. Allt detta gör navigering och extrahering av text en svår uppgift för webbskrapor.

Fokuset av detta examensarbete är utvecklingen av en webbskrapa specialiserad på forum. Tre olika metoder för textutvinning implementeras och testas innan man väljer den lämpligaste metoden för uppgiften. Metoderna är Word Count, Text Detection Framework och Text-to-Tag Ratio. Hanteringen av länk dubbleringar noga övervägd och löses genom att implementera ett flerlagers bloom filter. Examensarbetet genomförs med tillämpning av en kvalitativ metodik. Resultaten indikerar att Text-to-Tag Ratio har den bästa övergripande prestandan och ger det mest önskvärda resultatet i webbforum. Således var detta den valda metoden att behålla i den slutliga versionen av webbskrapan.

Place, publisher, year, edition, pages
2017. , p. 50
Series
TRITA-ICT-EX ; 2017:98
Keyword [en]
Data mining, Web Scraper, Java, Web forums, Text-extraction, Link Duplicates
Keyword [sv]
Data mining, Web Scraper, Java, Web forums, Text-extraction, Link Duplicates
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-219903OAI: oai:DiVA.org:kth-219903DiVA, id: diva2:1165892
Subject / course
Information and Communication Technology
Educational program
Bachelor of Science - Information and Communication Technology
Supervisors
Examiners
Available from: 2017-12-14 Created: 2017-12-14 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

fulltext(2624 kB)74 downloads
File information
File name FULLTEXT01.pdfFile size 2624 kBChecksum SHA-512
e115f1af0cf0f343defdbbee00250018f3b682bc0d8d5aa8c323c3b072dfd00a3b39962da18cedd45ca0025782d186cf85be1afb846a6e5ea2e80ef20d61084e
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 74 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 157 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf