Implementation of the HadoopMapReduce algorithm on virtualizedshared storage systems
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Context Hadoop is an open-source software framework developed for distributed storage and distributed processing of large sets of data. The implementation of the Hadoop MapReduce algorithm on virtualized shared storage by eliminating the concept of Hadoop Distributed File System (HDFS) is a challenging task. In this study, the Hadoop MapReduce algorithm is implemented on the Compuverde software that deals with virtualized shared storage of data.
Objectives In this study, the effect of using virtualized shared storage with Hadoop framework is identified. The main objective of this study is to design a method to implement the Hadoop MapReduce algorithm on Compuverde software that deals with virtualized shared storage of big data. Finally, the performance of the MapReduce algorithm on Compuverde shared storage (Compuverde File System - CVFS) is evaluated and compared to the performance of the MapReduce algorithm on HDFS.
Methods Initially a literature study is conducted to identify the effect of Hadoop implementation on virtualized shared storage. The Compuverde software is analyzed in detail during this literature study. The concepts of the MapReduce algorithms and the functioning of HDFS are scrutinized in detail. The next main research method that is adapted for this study is the implementation of a method where the Hadoop MapReduce algorithm is applied on the Compuverde software that deals with the virtualized shared storage by eliminating the HDFS. The next step is experimentation in which the performance of the implementation of the MapReduce algorithm on Compuverde shared storage (CVFS) in comparison with implementation of the MapReduce algorithm on Hadoop Distributed File System.
Results The experiment is conducted in two different scenarios namely the CPU bound scenario and I/O bound scenario. In CPU bound scenario, the average execution time of WordCount program has a linear growth with respect to size of data set. This linear growth is observed for both the file systems, HDFS and CVFS. The same is the case with I/O bound scenario. There is linear growth for both the file systems. When the averages of execution time are plotted on the graph, both the file systems perform similarly in CPU bound scenario(multi-node environment). In the I/O bound scenario (multi-node environment), HDFS slightly out performs CVFS when the size of 1.0GB and both the file systems performs without much difference when the size of data set is 0.5GB and 1.5GB.
Conclusions The MapReduce algorithm can be implemented on live data present in the virtualized shared storage systems without copying data into HDFS. In single node environment, distributed storage systems perform better than shared storage systems. In multi-node environment, when the CPU bound scenario is considered, both HDFS and CVFS file systems perform similarly. On the other hand, HDFS performs slightly better than CVFS for 1.0GB of data set in the I/O bound scenario. Hence we can conclude that distributed storage systems perform similar to the shared storage systems in both CPU bound and I/O bound scenarios in multi-node environment.
Place, publisher, year, edition, pages
2016. , 35 p.
Blekinge Tekniska Högskola Forskningsrapport, ISSN 1103-1581
Hadoop, virtualized systems, shared storage, MapReduce, Hadoop Distributed File System
IdentifiersURN: urn:nbn:se:bth-11876OAI: oai:DiVA.org:bth-11876DiVA: diva2:926050
Subject / course
DV2566 Master's Thesis (120 credits) in Computer Science
DVACS Master of Science Programme in Computer Science
2016-01-25, J1640, Blekinge Institute of Technology, Karlskrona, 09:00 (English)
Lundberg, Lars, Professor and Head of Computer Science Department