Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Vertical scaling for big data analytics and processing - A case study
Mälardalen University, School of Innovation, Design and Engineering.
2025 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Big data is often used as a buzzword but despite this common usage of the term it is still poorly understood. This paper explores new methods of handling big data based on technological advancements in big data tools, cloud computing capabilities and file storage. This paper focuses on new tools such as Polars, Duckdb and Daft to determine how mature these tools are and whether they hold their promises. By using a busy environment consisting of a 4 core machine with 16GB of RAM, Duckdb and Daft were able to analyze and process datasets ranging from a few million rows to over a billion rows. While Polars fell behind as the data size grew, it proved formidable for data engineering tasks consisting of data ingestion, wrangling and I/O. With the ability of these tools to perform out of core computations on datasets many times larger than RAM and cloud vendors offering machines with thousands of cores and tens of terabytes of RAM on a single machine, analyzing and processing big data has never been more simplistic. Reduction in code complexity, environment complexity, compute costs and lowering the skill ceiling for data analytics and engineering problems are a few advantages when compared to distributed big data tools such as Apache Hadoop and Spark.

Place, publisher, year, edition, pages
2025. , p. 100
Keywords [en]
Vertical scaling, Big data, Analytics, Pandas, Polars, Duckdb, Dask, Daft, Parquet, Arrow
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:mdh:diva-70150OAI: oai:DiVA.org:mdh-70150DiVA, id: diva2:1937826
Subject / course
Computer Science
Available from: 2025-02-17 Created: 2025-02-14 Last updated: 2025-02-17Bibliographically approved

Open Access in DiVA

VerticalScalingForBigData(3857 kB)224 downloads
File information
File name FULLTEXT01.pdfFile size 3857 kBChecksum SHA-512
e8358719697b524bb130fbf343337638b31540106d219381829dc52cf46ae803502b9d5071b410f840144abb58e920d4b0fc016076a63db0502dc49e1e878e6b
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Asplin, Mathias
By organisation
School of Innovation, Design and Engineering
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 224 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 413 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf