Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Order in the random forest
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered.

In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events.

In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2017. , p. 76
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-004
Keywords [en]
Machine learning, random forest, ensemble, time series, data series, sequential data, sparse data, high-dimensional data
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-142052ISBN: 978-91-7649-827-9 (print)ISBN: 978-91-7649-828-6 (electronic)OAI: oai:DiVA.org:su-142052DiVA, id: diva2:1090364
Public defence
2017-06-08, L30, NOD-huset, Borgarfjordsgatan 12, Stockholm, 13:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research , IIS11-0053Available from: 2017-05-16 Created: 2017-04-24 Last updated: 2022-02-28Bibliographically approved
List of papers
1. Predicting Adverse Drug Events by Analyzing Electronic Patient Records
Open this publication in new window or tab >>Predicting Adverse Drug Events by Analyzing Electronic Patient Records
2013 (English)In: Artificial Intelligence in Medicine: 14th Conference on Artificial Intelligence in Medicine, AIME 2013. Proceedings / [ed] Niels Peek, Roque Marín Morales, Mor Peleg, Springer Berlin/Heidelberg, 2013, Vol. 7885, p. 125-129Conference paper, Published paper (Refereed)
Abstract [en]

Diagnosis codes for adverse drug events (ADEs) are sometimes missing from electronic patient records (EPRs). This may not only affect patient safety in the worst case, but also the number of reported ADEs, resulting in incorrect risk estimates of prescribed drugs. Large databases of electronic patient records (EPRs) are potentially valuable sources of information to support the identification of ADEs. This study investigates the use of machine learning for predicting one specific ADE based on information extracted from EPRs, including age, gender, diagnoses and drugs. Several predictive models are developed and evaluated using different learning algorithms and feature sets. The highest observed AUC is 0.87, obtained by the random forest algorithm. The resulting model can be used for screening EPRs that are not, but possibly should be, assigned a diagnosis code for the ADE under consideration. Preliminary results from using the model are presented.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7885
Keywords
machine learning, electronic patient records, adverse drug events
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-96366 (URN)10.1007/978-3-642-38326-7_19 (DOI)978-3-642-38325-0 (ISBN)978-3-642-38326-7 (ISBN)
Conference
14th Conference on Artificial Intelligence in Medicine, AIME 2013, Murcia, Spain, May 29 – June 1, 2013
Available from: 2013-11-20 Created: 2013-11-20 Last updated: 2022-02-24Bibliographically approved
2. Dimensionality Reduction with Random Indexing: An Application on Adverse Drug Event Detection using Electronic Health Records
Open this publication in new window or tab >>Dimensionality Reduction with Random Indexing: An Application on Adverse Drug Event Detection using Electronic Health Records
2014 (English)In: IEEE 27th International Symposium on Computer-Based Medical Systems, New York: IEEE Computer Society, 2014, p. 304-307Conference paper, Published paper (Refereed)
Abstract [en]

Although electronic health records (EHRs) have recently become an important data source for drug safety signals detection, which is usually evaluated in clinical trials, the use of such data is often prohibited by dimensionality and available computer resources. Currently, several methods for reducing dimensionality are developed, used and evaluated within the medical domain. While these methods perform well, the computational cost tends to increase with growing dimensionality. An alternative solution is random indexing, a technique commonly employed in text classification to reduce the dimensionality of large and sparse documents. This study aims to explore how the predictive performance of random forest is affected by dimensionality reduction through random indexing to predict adverse drug reactions (ADEs). Data are extracted from EHRs and the task is to predict whether or not a patient should be assigned an ADE related diagnosis code. Four different dimensionality settings are investigated and their sensitivity, specificity and area under ROC curve are reported for 14 data sets. The results show that for the investigated data sets, the predictive performance is not negatively affected by dimensionality reduction, however, the computational cost is significantly reduced. Therefore, this study concludes that applying random indexing on EHR data reduces the computational cost, while retaining the predictive performance.

Place, publisher, year, edition, pages
New York: IEEE Computer Society, 2014
Series
IEEE International Symposium on Computer-Based Medical Systems, ISSN 1063-7125
Keywords
dimensionality reduction, random forest, random indexing, electronic health records, adverse drug events
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-110975 (URN)10.1109/CBMS.2014.22 (DOI)000345222200060 ()978-1-4799-4435-4 (ISBN)
Conference
27th IEEE International Symposium on Computer-Based Medical Systems (CBMS), New York, USA, May 27-29, 2014
Available from: 2014-12-19 Created: 2014-12-19 Last updated: 2022-02-23Bibliographically approved
3. Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
Open this publication in new window or tab >>Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
2014 (English)In: IEEE International Conference on Healthcare Informatics (ICHI): Proceedings, IEEE Computer Society, 2014, p. 17-22Conference paper, Published paper (Refereed)
Abstract [en]

When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics; F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-110994 (URN)10.1109/ICHI.2014.10 (DOI)978-1-4799-5701-9 (ISBN)
Conference
IEEE International Conference on Healthcare Informatics, Verona, Italy, 15-17 September 2014
Available from: 2014-12-19 Created: 2014-12-19 Last updated: 2022-02-23Bibliographically approved
4. Forests of Randomized Shapelet Trees
Open this publication in new window or tab >>Forests of Randomized Shapelet Trees
2015 (English)In: Statistical Learning and Data Sciences: Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, p. 126-136Conference paper, Published paper (Refereed)
Abstract [en]

Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is however computationally costly, and although several techniques for speeding up the technique have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.

Place, publisher, year, edition, pages
Springer, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9047
Keywords
data series classification, shapelets, decision trees, ensemble
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122820 (URN)10.1007/978-3-319-17091-6_8 (DOI)000361990900008 ()978-3-319-17090-9 (ISBN)978-3-319-17091-6 (ISBN)
Conference
Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015
Available from: 2015-11-11 Created: 2015-11-10 Last updated: 2022-02-23Bibliographically approved
5. Multi-channel ECG classification using forests of randomized shapelet trees
Open this publication in new window or tab >>Multi-channel ECG classification using forests of randomized shapelet trees
2015 (English)In: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Association for Computing Machinery (ACM), 2015, article id 43Conference paper, Published paper (Refereed)
Abstract [en]

Data series of multiple channels occur at high rates and in massive quantities in several application domains, such as healthcare. In this paper, we study the problem of multi-channel ECG classification. We map this problem to multivariate data series classification and propose five methods for solving it, using a split-and-combine approach. The proposed framework is evaluated using three base-classifiers on real-world data for detecting Myocardial Infarction. Extensive experiments are performed on real ECG data extracted from the Physiobank data repository. Our findings emphasize the importance of selecting an appropriate base-classifier for multivariate data series classification, while demonstrating the superiority of the Random Shapelet Forest (0.825 accuracy) against competitor methods (0.664 accuracy for 1-NN under cDTW).

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2015
Keywords
Data series, classification, multi-channel
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122846 (URN)10.1145/2769493.2769520 (DOI)978-1-4503-3452-5 (ISBN)
Conference
PETRA '15, 8th Pervasive Technologies Related to Assistive Environments, Corfu, Greece, July 01 - 03, 2015
Available from: 2015-11-11 Created: 2015-11-10 Last updated: 2022-02-23Bibliographically approved
6. Generalized random shapelet forests
Open this publication in new window or tab >>Generalized random shapelet forests
2016 (English)In: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 30, no 5, p. 1053-1085Article in journal (Refereed) Published
Abstract [en]

Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

Keywords
Multivariate time series, Time series classification, Time series shapelets, Decision trees, Ensemble methods
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135052 (URN)10.1007/s10618-016-0473-y (DOI)000382010500004 ()
Conference
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, Riva del Garda, Italy, September 19-23, 2016
Available from: 2016-10-31 Created: 2016-10-31 Last updated: 2022-02-28Bibliographically approved
7. Predicting Adverse Drug Events using Heterogeneous Event Sequences
Open this publication in new window or tab >>Predicting Adverse Drug Events using Heterogeneous Event Sequences
2016 (English)In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), IEEE Computer Society, 2016, p. 356-362Conference paper, Published paper (Refereed)
Abstract [en]

Adverse drug events (ADEs) are known to be severely under-reported in electronic health record (EHR) systems. One approach to mitigate this problem is to employ machine learning methods to detect and signal for potentially missing ADEs, with the aim of increasing reporting rates. There are, however, many challenges involved in constructing prediction models for this task, since data present in health care records is heterogeneous, high dimensional, sparse and temporal. Previous approaches typically employ bag-of-items representations of clinical events that are present in a record, ignoring the temporal aspects. In this paper, we study the problem of classifying heterogeneous and multivariate event sequences using a novel algorithm building on the well known concept of ensemble learning. The proposed approach is empirically evaluated using 27 datasets extracted from a real EHR database with different ADEs present. The results indicate that the proposed approach, which explicitly models the temporal nature of clinical data, can be expected to outperform, in terms of the trade-off between precision and specificity, models that do no consider the temporal aspects.

Place, publisher, year, edition, pages
IEEE Computer Society, 2016
Keywords
Adverse drug events, temporal patterns, data series, ensemble methods, random forest
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135439 (URN)10.1109/ICHI.2016.64 (DOI)000391422100065 ()978-1-5090-6117-4 (ISBN)
Conference
IEEE International Conference on Health Care Informatics, Chicago, Illinois, USA, October 4-7, 2016
Available from: 2016-11-08 Created: 2016-11-08 Last updated: 2022-02-28Bibliographically approved
8. Early Random Shapelet Forest
Open this publication in new window or tab >>Early Random Shapelet Forest
2016 (English)In: Discovery Science: 19th International Conference, DS 2016, Bari, Italy, October 19–21, 2016, Proceedings / [ed] Toon Calders, Michelangelo Ceci, Donato Malerba, Springer, 2016, p. 261-276Conference paper, Published paper (Refereed)
Abstract [en]

Early classification of time series has emerged as an increasingly important and challenging problem within signal processing, especially in domains where timely decisions are critical, such as medical diagnosis in health-care. Shapelets, i.e., discriminative sub-sequences, have been proposed for time series classification as a means to capture local and phase independent information. Recently, forests of randomized shapelet trees have been shown to produce state-of-the-art predictive performance at a low computational cost. In this work, they are extended to allow for early classification of time series. An extensive empirical investigation is presented, showing that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

Place, publisher, year, edition, pages
Springer, 2016
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9956
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135411 (URN)10.1007/978-3-319-46307-0_17 (DOI)000388252100017 ()978-3-319-46306-3 (ISBN)978-3-319-46307-0 (ISBN)
Conference
19th International Conference, DS 2016, Bari, Italy, October 19–21, 2016
Available from: 2016-11-08 Created: 2016-11-08 Last updated: 2022-02-28Bibliographically approved

Open Access in DiVA

Order in the random forest(1451 kB)3152 downloads
File information
File name FULLTEXT01.pdfFile size 1451 kBChecksum SHA-512
9c59f81f2e4f3cb7017329df888a87e99f29e310f281b8c48853ed11da047e07af95266d6ef872c9b60b5848c7e1cf462d4b0a2e13debe14676eca0e67417ce0
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Karlsson, Isak
By organisation
Department of Computer and Systems Sciences
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 3153 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 4271 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf