Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Parallel Bayesian Additive Regression Trees, using Apache Spark
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

New methods have been developed to find patterns and trends in order to gainknowledge from large datasets in various disciplines, such as bioinformatics, consumer behavior in advertising and weather forecasting.The goal of many of these new methods is to construct prediction models from the data. Linear regression, which is widely used for analyzing data, is very powerful ford etecting simple patterns, but higher complexity requires a more sophisticated solution. Regression trees split up the problem into numerous parts but they do not generalizewell as they tend to have high variance. Ensemble methods, a collection of regressiontrees, solves that problem by spreading the model over numerous trees. Ensemble methods such as Random Forest, Gradient Boosted Trees and Bayesian Additive Regression Trees, all have different ways to constructing prediction modelfrom data. Using these models for large datasets are computationally demanding.The aim of this work is to explore a parallel implementation of Bayesian Additive Regression Trees (BART) using Apache Spark framework. Spark is ideal in this case asit is great for iterative and data intensive jobs.We show that our parallel implementation is about 35 times faster for a dataset of pig's genomes. Most of the speed improvement is due to serial code modification that minimizes scanning of the data.The gain from parallelization is a speedup of 2.2x, gained by using four cores on aquad core system. Measurements on a computer clusters consisting of four computers resulted in a maximum speedup of 2.1x for eight cores.We should emphasize that these gains are heavily dependent on size of datasets.

Place, publisher, year, edition, pages
2017. , 35 p.
Series
IT, 17005
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:uu:diva-322247OAI: oai:DiVA.org:uu-322247DiVA: diva2:1096369
Educational program
Master Programme in Computer Science
Supervisors
Examiners
Available from: 2017-05-17 Created: 2017-05-17 Last updated: 2017-05-17Bibliographically approved

Open Access in DiVA

fulltext(537 kB)38 downloads
File information
File name FULLTEXT01.pdfFile size 537 kBChecksum SHA-512
533be3c5c31454c0ded80ad9cec99b0ef03d4617241b11fd7638bb2577cea94275f33a58350e180ecafe3f45f41a6fc4fff0fd2602ab5a50d125b8ef198e8a13
Type fulltextMimetype application/pdf

By organisation
Department of Information Technology
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 38 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 167 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf