Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Extractive Multi-document Summarization of News Articles
Linköping University, Department of Computer and Information Science.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases.

Place, publisher, year, edition, pages
2019. , p. 32
Keywords [en]
NLP, extractive summarization, multi-document, neural embeddings, information extraction, text-to-text generation, textrank, BERT, attention, Transformer, transfer learning, fine-tuning, ROUGE
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-158275ISRN: LIU-IDA/LITH-EX-A--19/025--SEOAI: oai:DiVA.org:liu-158275DiVA, id: diva2:1332278
External cooperation
Svenska Dagbladet
Subject / course
Computer science
Supervisors
Examiners
Available from: 2019-06-28 Created: 2019-06-28 Last updated: 2019-06-28Bibliographically approved

Open Access in DiVA

fulltext(451 kB)68 downloads
File information
File name FULLTEXT01.pdfFile size 451 kBChecksum SHA-512
8c1b36d22c6438ad0bf062c0d6422a560d0fbce45a376be4cbae488e6f9532ab65af35558d53be00da7664538f723f382d17a8d0e21d8ff032b8f4f3d03e6e59
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Grant, Harald
By organisation
Department of Computer and Information Science
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 68 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 277 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf