Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Random forest och glesa datarespresentationer
University of Borås, School of Business and IT.
University of Borås, School of Business and IT.
University of Borås, School of Business and IT.
2012 (English)Independent thesis Basic level (degree of Bachelor)Student thesisAlternative title
Random forest using sparse data structures (English)
Abstract [en]

In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models. In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy. A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets.

Place, publisher, year, edition, pages
University of Borås/School of Business and IT , 2012.
Series
Kandidatuppsats ; 2012KSAI01
Keywords [en]
data mining, machine learning, regression, classification, in silico modeling, random forest, sparse data, feature selection
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:hb:diva-16672Local ID: 2320/11145OAI: oai:DiVA.org:hb-16672DiVA, id: diva2:1308560
Note
Program: SystemarkitekturutbildningenAvailable from: 2019-04-30 Created: 2019-04-30

Open Access in DiVA

fulltext(1457 kB)102 downloads
File information
File name FULLTEXT01.pdfFile size 1457 kBChecksum SHA-512
479aabf06476f09e5bc1e44427d8b1fb00c850abf8e4c81baf72128b36f72afe1d36fb953adc93508024b0713eb3656361ca441f75407c67747e5360fb1fb4ba
Type fulltextMimetype application/pdf

By organisation
School of Business and IT
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 102 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf