Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Digital Services and Systems.ORCID iD: 0000-0002-1095-8437
2019 (English)In: Applied System Innovation, ISSN 2571-5577, Vol. 2, no 4, article id 37Article in journal (Refereed) Published
Abstract [en]

Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.

Place, publisher, year, edition, pages
MDPI, 2019. Vol. 2, no 4, article id 37
Keywords [en]
web scraping, web crawling, CRISP-DM, text mining, relevant literature search, research methodology
National Category
Information Systems, Social aspects
Research subject
Information systems
Identifiers
URN: urn:nbn:se:ltu:diva-77090DOI: 10.3390/asi2040037OAI: oai:DiVA.org:ltu-77090DiVA, id: diva2:1376063
Available from: 2019-12-06 Created: 2019-12-06 Last updated: 2019-12-09Bibliographically approved

Open Access in DiVA

fulltext(2232 kB)4 downloads
File information
File name FULLTEXT01.pdfFile size 2232 kBChecksum SHA-512
4d33abbbf1eccafc4656d1ba614b9b54a05971a17246e66b4871f50786db552f2bda62b11cd57b2e0264dcc41d5d8b57a10428ff41ed776dabd94a3e7007bf93
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Hassanien, Hossam
By organisation
Digital Services and Systems
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar
Total: 4 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf