Change search
ReferencesLink to record
Permanent link

Direct link
A Cloud Based Platform for Big Data Science
Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology. (PELAB)
2014 (English)Independent thesis Advanced level (degree of Master (Two Years)), 30 credits / 45 HE creditsStudent thesis
Abstract [en]

With the advent of cloud computing, resizable scalable infrastructures for data processing is now available to everyone. Software platforms and frameworks that support data intensive distributed applications such as Amazon Web Services and Apache Hadoop enable users to the necessary tools and infrastructure to work with thousands of scalable computers and process terabytes of data. However writing scalable applications that are run on top of these distributed frameworks is still a demanding and challenging task. The thesis aimed to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large data sets, collectively known as “big data”. The term “big-data” in this thesis refers to large, diverse, complex, longitudinal and/or distributed data sets generated from instruments, sensors, internet transactions, email, social networks, twitter streams, and/or all digital sources available today and in the future. We introduced architectures and concepts for implementing a cloud-based infrastructure for analyzing large volume of semi-structured and unstructured data. We built and evaluated an application prototype for collecting, organizing, processing, visualizing and analyzing data from the retail industry gathered from indoor navigation systems and social networks (Twitter, Facebook etc). Our finding was that developing large scale data analysis platform is often quite complex when there is an expectation that the processed data will grow continuously in future. The architecture varies depend on requirements. If we want to make a data warehouse and analyze the data afterwards (batch processing) the best choices will be Hadoop clusters and Pig or Hive. This architecture has been proven in Facebook and Yahoo for years. On the other hand, if the application involves real-time data analytics then the recommendation will be Hadoop clusters with Storm which has been successfully used in Twitter. After evaluating the developed prototype we introduced a new architecture which will be able to handle large scale batch and real-time data. We also proposed an upgrade of the existing prototype to handle real-time indoor navigation data.

Place, publisher, year, edition, pages
2014. , 57 p.
Keyword [en]
Big Data, Data Analysis, Hadoop, Hive, Sentiment Analysis, Predictive Analysis, Fraud Detection, Big data concepts, NoSQL Databases, Amazon AWS, Windows Azure, Data Visualization, Lambda architecture
National Category
Software Engineering
URN: urn:nbn:se:liu:diva-103700ISRN: LIU-IDA/LITH-EX-A--14/006--SEOAI: diva2:690525
Subject / course
Computer and information science at the Institute of Technology
2013-08-29, Allen Newell, Hus E, 1 trappa (plan 3), Linkoping, 10:15 (English)
Available from: 2014-01-27 Created: 2014-01-23 Last updated: 2014-01-27Bibliographically approved

Open Access in DiVA

A Cloud Based Platform for Big Data Science(1633 kB)1539 downloads
File information
File name FULLTEXT01.pdfFile size 1633 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Islam, Md. Zahidul
By organisation
Software and SystemsThe Institute of Technology
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1539 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 1366 hits
ReferencesLink to record
Permanent link

Direct link