Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Dataset versioning for Hops File System: Snapshotting solution for reliable and reproducible data science experiments
KTH, School of Information and Communication Technology (ICT).
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

As the awareness of the potential of Big Data arises, more and more companies are starting to create their own Data Science divisions and their projects are becoming big and complex handled by big multidisciplinary teams. Furthermore, with the expansion of fields such as Deep Learning, Data Science is becoming a very popular research field both in companies and universities.

In this context it becomes crucial for Data Scientists to be able to reproduce their experiments and test them against previous models developed in previous versions of a dataset. This Master Thesis project presents the design and implementation of a snapshotting system for the distributed File System HopsFS based on Apache HDFS and developed at the Swedish Institute of Computer Science (SICS).

This project improves on previous solutions designed for both HopsFS and HDFS by solving problems such as the handling of incomplete blocks in snapshots while also adding new features such as the automatic snapshots to allow users to undo the last few changes made in a file.

Finally, an analysis of the implementation was performed in order to compare it to the previous state of HopsFS and calculate the impact of the solution on the different operations performed by the system. Said analysis showed an increase of around 40% in the time needed to perform operations such as read and write with different workloads due mostly to the new database queries used in this solution.

Abstract [sv]

När kunskapen om Big Data-potentialen uppstår, börjar allt fler företag skapa egna datavetenskapsavdelningar och deras projekt blir stora och komplexa hanteras av stora tvärvetenskapliga team. Vidare, med expansionen av fält som Deep Learning, blir datavetenskap ett mycket populärt forskningsområde både i företag och universitet.I detta sammanhang blir det avgörande för datavetenskapare att kunna reproducera sina experiment och testa dem mot tidigare modeller som utvecklats i tidigare versioner av en dataset. Detta masterprojekt presenterar design och implementeringav ett ögonblickssystem för det distribuerade filsystemet HopsFS baserat påApache HDFS och utvecklat på SICS.Detta projekt förbättras på tidigare lösningar utformade för både HopsFSoch HDFS genom att lösa problem som hantering av ofullständiga block i ögonblicksbilder samtidigt som du lägger till nya funktioner som de automatiska ögonblicksbilderna så att användarna kan ångra de senaste ändringarna i en fil.Slutligen genomfördes en analys av genomförandet för att jämföra det med HopsFS tidigare tillstånd och beräkna lösningens inverkan på de olika operationersom utförts av systemet. Nämnda analys visade en ökning på omkring 40 %i den tid som behövs för att utföra operationer som läs och skriv med olikaarbetsbelastningar, för det mesta beror på denna lösning.

Place, publisher, year, edition, pages
2017. , p. 57
Series
TRITA-ICT-EX ; 2017:159
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-215722OAI: oai:DiVA.org:kth-215722DiVA, id: diva2:1149125
Subject / course
Computer Science
Educational program
Master of Science - Computer Science
Supervisors
Examiners
Available from: 2017-10-13 Created: 2017-10-13 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

fulltext(1786 kB)13 downloads
File information
File name FULLTEXT01.pdfFile size 1786 kBChecksum SHA-512
dc59a150b5b272cb724da39b7f08b97f4f27c24afed9b3cc3dae3a286a44d357fb79ac2f7e737f712f257b3dfb656b0d7487b133c8ae3258d15d8b1a9c2b5dad
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 13 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 42 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf