Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Order in the random forest
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
2017 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered.

In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events.

In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.

Ort, förlag, år, upplaga, sidor
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2017. , 76 s.
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-004
Nyckelord [en]
Machine learning, random forest, ensemble, time series, data series, sequential data, sparse data, high-dimensional data
Nationell ämneskategori
Data- och informationsvetenskap
Forskningsämne
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-142052ISBN: 978-91-7649-827-9 (tryckt)ISBN: 978-91-7649-828-6 (digital)OAI: oai:DiVA.org:su-142052DiVA: diva2:1090364
Disputation
2017-06-08, L30, NOD-huset, Borgarfjordsgatan 12, Stockholm, 13:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2017-05-16 Skapad: 2017-04-24 Senast uppdaterad: 2017-05-15Bibliografiskt granskad
Delarbeten
1. Predicting Adverse Drug Events by Analyzing Electronic Patient Records
Öppna denna publikation i ny flik eller fönster >>Predicting Adverse Drug Events by Analyzing Electronic Patient Records
2013 (Engelska)Ingår i: Artificial Intelligence in Medicine: 14th Conference on Artificial Intelligence in Medicine, AIME 2013. Proceedings / [ed] Niels Peek, Roque Marín Morales, Mor Peleg, Springer Berlin/Heidelberg, 2013, Vol. 7885, 125-129 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Diagnosis codes for adverse drug events (ADEs) are sometimes missing from electronic patient records (EPRs). This may not only affect patient safety in the worst case, but also the number of reported ADEs, resulting in incorrect risk estimates of prescribed drugs. Large databases of electronic patient records (EPRs) are potentially valuable sources of information to support the identification of ADEs. This study investigates the use of machine learning for predicting one specific ADE based on information extracted from EPRs, including age, gender, diagnoses and drugs. Several predictive models are developed and evaluated using different learning algorithms and feature sets. The highest observed AUC is 0.87, obtained by the random forest algorithm. The resulting model can be used for screening EPRs that are not, but possibly should be, assigned a diagnosis code for the ADE under consideration. Preliminary results from using the model are presented.

Ort, förlag, år, upplaga, sidor
Springer Berlin/Heidelberg, 2013
Serie
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7885
Nyckelord
machine learning, electronic patient records, adverse drug events
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-96366 (URN)10.1007/978-3-642-38326-7_19 (DOI)978-3-642-38325-0 (ISBN)978-3-642-38326-7 (ISBN)
Konferens
14th Conference on Artificial Intelligence in Medicine, AIME 2013, Murcia, Spain, May 29 – June 1, 2013
Tillgänglig från: 2013-11-20 Skapad: 2013-11-20 Senast uppdaterad: 2017-04-24Bibliografiskt granskad
2. Dimensionality Reduction with Random Indexing: An Application on Adverse Drug Event Detection using Electronic Health Records
Öppna denna publikation i ny flik eller fönster >>Dimensionality Reduction with Random Indexing: An Application on Adverse Drug Event Detection using Electronic Health Records
2014 (Engelska)Ingår i: IEEE 27th International Symposium on Computer-Based Medical Systems, New York: IEEE Computer Society, 2014, 304-307 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Although electronic health records (EHRs) have recently become an important data source for drug safety signals detection, which is usually evaluated in clinical trials, the use of such data is often prohibited by dimensionality and available computer resources. Currently, several methods for reducing dimensionality are developed, used and evaluated within the medical domain. While these methods perform well, the computational cost tends to increase with growing dimensionality. An alternative solution is random indexing, a technique commonly employed in text classification to reduce the dimensionality of large and sparse documents. This study aims to explore how the predictive performance of random forest is affected by dimensionality reduction through random indexing to predict adverse drug reactions (ADEs). Data are extracted from EHRs and the task is to predict whether or not a patient should be assigned an ADE related diagnosis code. Four different dimensionality settings are investigated and their sensitivity, specificity and area under ROC curve are reported for 14 data sets. The results show that for the investigated data sets, the predictive performance is not negatively affected by dimensionality reduction, however, the computational cost is significantly reduced. Therefore, this study concludes that applying random indexing on EHR data reduces the computational cost, while retaining the predictive performance.

Ort, förlag, år, upplaga, sidor
New York: IEEE Computer Society, 2014
Serie
IEEE International Symposium on Computer-Based Medical Systems, ISSN 1063-7125
Nyckelord
dimensionality reduction, random forest, random indexing, electronic health records, adverse drug events
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-110975 (URN)10.1109/CBMS.2014.22 (DOI)000345222200060 ()978-1-4799-4435-4 (ISBN)
Konferens
27th IEEE International Symposium on Computer-Based Medical Systems (CBMS), New York, USA, May 27-29, 2014
Tillgänglig från: 2014-12-19 Skapad: 2014-12-19 Senast uppdaterad: 2017-04-24Bibliografiskt granskad
3. Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
Öppna denna publikation i ny flik eller fönster >>Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
2014 (Engelska)Ingår i: IEEE International Conference on Healthcare Informatics (ICHI): Proceedings, IEEE Computer Society, 2014, 17-22 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics; F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

Ort, förlag, år, upplaga, sidor
IEEE Computer Society, 2014
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-110994 (URN)10.1109/ICHI.2014.10 (DOI)978-1-4799-5701-9 (ISBN)
Konferens
IEEE International Conference on Healthcare Informatics, Verona, Italy, 15-17 September 2014
Tillgänglig från: 2014-12-19 Skapad: 2014-12-19 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
4. Forests of Randomized Shapelet Trees
Öppna denna publikation i ny flik eller fönster >>Forests of Randomized Shapelet Trees
2015 (Engelska)Ingår i: Statistical Learning and Data Sciences: Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, 126-136 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is however computationally costly, and although several techniques for speeding up the technique have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.

Ort, förlag, år, upplaga, sidor
Springer, 2015
Serie
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9047
Nyckelord
data series classification, shapelets, decision trees, ensemble
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122820 (URN)10.1007/978-3-319-17091-6_8 (DOI)978-3-319-17091-6 (ISBN)
Konferens
Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015
Tillgänglig från: 2015-11-11 Skapad: 2015-11-10 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
5. Multi-channel ECG classification using forests of randomized shapelet trees
Öppna denna publikation i ny flik eller fönster >>Multi-channel ECG classification using forests of randomized shapelet trees
2015 (Engelska)Ingår i: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Association for Computing Machinery (ACM), 2015, 43Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Data series of multiple channels occur at high rates and in massive quantities in several application domains, such as healthcare. In this paper, we study the problem of multi-channel ECG classification. We map this problem to multivariate data series classification and propose five methods for solving it, using a split-and-combine approach. The proposed framework is evaluated using three base-classifiers on real-world data for detecting Myocardial Infarction. Extensive experiments are performed on real ECG data extracted from the Physiobank data repository. Our findings emphasize the importance of selecting an appropriate base-classifier for multivariate data series classification, while demonstrating the superiority of the Random Shapelet Forest (0.825 accuracy) against competitor methods (0.664 accuracy for 1-NN under cDTW).

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2015
Nyckelord
Data series, classification, multi-channel
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122846 (URN)10.1145/2769493.2769520 (DOI)978-1-4503-3452-5 (ISBN)
Konferens
PETRA '15, 8th Pervasive Technologies Related to Assistive Environments, Corfu, Greece, July 01 - 03, 2015
Tillgänglig från: 2015-11-11 Skapad: 2015-11-10 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
6. Generalized random shapelet forests
Öppna denna publikation i ny flik eller fönster >>Generalized random shapelet forests
2016 (Engelska)Ingår i: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 30, nr 5, 1053-1085 s.Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

Nyckelord
Multivariate time series, Time series classification, Time series shapelets, Decision trees, Ensemble methods
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-135052 (URN)10.1007/s10618-016-0473-y (DOI)000382010500004 ()
Konferens
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, Riva del Garda, Italy, September 19-23, 2016
Tillgänglig från: 2016-10-31 Skapad: 2016-10-31 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
7. Predicting Adverse Drug Events using Heterogeneous Event Sequences
Öppna denna publikation i ny flik eller fönster >>Predicting Adverse Drug Events using Heterogeneous Event Sequences
2016 (Engelska)Ingår i: 2016 IEEE International Conference on Healthcare Informatics (ICHI), IEEE Computer Society, 2016, 356-362 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Adverse drug events (ADEs) are known to be severely under-reported in electronic health record (EHR) systems. One approach to mitigate this problem is to employ machine learning methods to detect and signal for potentially missing ADEs, with the aim of increasing reporting rates. There are, however, many challenges involved in constructing prediction models for this task, since data present in health care records is heterogeneous, high dimensional, sparse and temporal. Previous approaches typically employ bag-of-items representations of clinical events that are present in a record, ignoring the temporal aspects. In this paper, we study the problem of classifying heterogeneous and multivariate event sequences using a novel algorithm building on the well known concept of ensemble learning. The proposed approach is empirically evaluated using 27 datasets extracted from a real EHR database with different ADEs present. The results indicate that the proposed approach, which explicitly models the temporal nature of clinical data, can be expected to outperform, in terms of the trade-off between precision and specificity, models that do no consider the temporal aspects.

Ort, förlag, år, upplaga, sidor
IEEE Computer Society, 2016
Nyckelord
Adverse drug events, temporal patterns, data series, ensemble methods, random forest
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-135439 (URN)10.1109/ICHI.2016.64 (DOI)978-1-5090-6117-4 (ISBN)
Konferens
IEEE International Conference on Health Care Informatics, Chicago, Illinois, USA, October 4-7, 2016
Tillgänglig från: 2016-11-08 Skapad: 2016-11-08 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
8. Early Random Shapelet Forest
Öppna denna publikation i ny flik eller fönster >>Early Random Shapelet Forest
2016 (Engelska)Ingår i: Discovery Science: 19th International Conference, DS 2016, Bari, Italy, October 19–21, 2016, Proceedings / [ed] Toon Calders, Michelangelo Ceci, Donato Malerba, Springer, 2016, 261-276 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Early classification of time series has emerged as an increasingly important and challenging problem within signal processing, especially in domains where timely decisions are critical, such as medical diagnosis in health-care. Shapelets, i.e., discriminative sub-sequences, have been proposed for time series classification as a means to capture local and phase independent information. Recently, forests of randomized shapelet trees have been shown to produce state-of-the-art predictive performance at a low computational cost. In this work, they are extended to allow for early classification of time series. An extensive empirical investigation is presented, showing that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

Ort, förlag, år, upplaga, sidor
Springer, 2016
Serie
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9956
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-135411 (URN)10.1007/978-3-319-46307-0_17 (DOI)978-3-319-46306-3 (ISBN)978-3-319-46307-0 (ISBN)
Konferens
19th International Conference, DS 2016, Bari, Italy, October 19–21, 2016
Tillgänglig från: 2016-11-08 Skapad: 2016-11-08 Senast uppdaterad: 2017-04-28Bibliografiskt granskad

Open Access i DiVA

Order in the random forest(1451 kB)124 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1451 kBChecksumma SHA-512
9c59f81f2e4f3cb7017329df888a87e99f29e310f281b8c48853ed11da047e07af95266d6ef872c9b60b5848c7e1cf462d4b0a2e13debe14676eca0e67417ce0
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Karlsson, Isak
Av organisationen
Institutionen för data- och systemvetenskap
Data- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 124 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

Totalt: 1176 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf