Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
2014 (Engelska)Ingår i: IEEE International Conference on Healthcare Informatics (ICHI): Proceedings, IEEE Computer Society, 2014, 17-22 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics; F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

Ort, förlag, år, upplaga, sidor
IEEE Computer Society, 2014. 17-22 s.
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-110994DOI: 10.1109/ICHI.2014.10ISBN: 978-1-4799-5701-9 (digital)OAI: oai:DiVA.org:su-110994DiVA: diva2:773768
Konferens
IEEE International Conference on Healthcare Informatics, Verona, Italy, 15-17 September 2014
Tillgänglig från: 2014-12-19 Skapad: 2014-12-19 Senast uppdaterad: 2017-04-28Bibliografiskt granskad
Ingår i avhandling
1. Order in the random forest
Öppna denna publikation i ny flik eller fönster >>Order in the random forest
2017 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered.

In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events.

In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.

Ort, förlag, år, upplaga, sidor
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2017. 76 s.
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-004
Nyckelord
Machine learning, random forest, ensemble, time series, data series, sequential data, sparse data, high-dimensional data
Nationell ämneskategori
Data- och informationsvetenskap
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-142052 (URN)978-91-7649-827-9 (ISBN)978-91-7649-828-6 (ISBN)
Disputation
2017-06-08, L30, NOD-huset, Borgarfjordsgatan 12, Stockholm, 13:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2017-05-16 Skapad: 2017-04-24 Senast uppdaterad: 2017-05-15Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas

Övriga länkar

Förlagets fulltext

Sök vidare i DiVA

Av författaren/redaktören
Karlsson, IsakBoström, Henrik
Av organisationen
Institutionen för data- och systemvetenskap
Systemvetenskap, informationssystem och informatik

Sök vidare utanför DiVA

GoogleGoogle Scholar

Altmetricpoäng

Totalt: 126 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf