Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Learning Predictive Models from Electronic Health Records
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
2017 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

The ongoing digitization of healthcare, which has been much accelerated by the widespread adoption of electronic health records, generates unprecedented amounts of clinical data in a readily computable form. This, in turn, affords great opportunities for making meaningful secondary use of clinical data in the endeavor to improve healthcare, as well as to support epidemiology and medical research. To that end, there is a need for techniques capable of effectively and efficiently analyzing large amounts of clinical data. While machine learning provides the necessary tools, learning effective predictive models from electronic health records comes with many challenges due to the complexity of the data. Electronic health records contain heterogeneous and longitudinal data that jointly provides a rich perspective of patient trajectories in the healthcare process. The diverse characteristics of the data need to be properly accounted for when learning predictive models from clinical data. However, how best to represent healthcare data for predictive modeling has been insufficiently studied. This thesis addresses several of the technical challenges involved in learning effective predictive models from electronic health records.

Methods are developed to address the challenges of (i) representing heterogeneous types of data, (ii) leveraging the concept hierarchy of clinical codes, and (iii) modeling the temporality of clinical events. The proposed methods are evaluated empirically in the context of detecting adverse drug events in electronic health records. Various representations of each type of data that account for its unique characteristics are investigated and it is shown that combining multiple representations yields improved predictive performance. It is also demonstrated how the information embedded in the concept hierarchy of clinical codes can be exploited, both for creating enriched feature spaces and for decomposing the predictive task. Moreover, incorporating temporal information leads to more effective predictive models by distinguishing between event occurrences in the patient history. Both single-point representations, using pre-assigned or learned temporal weights, and multivariate time series representations are shown to be more informative than representations in which temporality is ignored. Effective methods for representing heterogeneous and longitudinal data are key for enhancing and truly enabling meaningful secondary use of electronic health records through large-scale analysis of clinical data.

sted, utgiver, år, opplag, sider
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2017. , 82 s.
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-001
Emneord [en]
Data Science, Machine Learning, Predictive Modeling, Data Representation, Health Informatics, Electronic Health Records
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-137936ISBN: 978-91-7649-682-4 (tryckt)ISBN: 978-91-7649-683-1 (digital)OAI: oai:DiVA.org:su-137936DiVA: diva2:1067764
Disputas
2017-03-02, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2017-02-07 Laget: 2017-01-13 Sist oppdatert: 2017-02-08bibliografisk kontrollert
Delarbeid
1. Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes
Åpne denne publikasjonen i ny fane eller vindu >>Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes
2014 (engelsk)Inngår i: 2014 IEEE International Conference on Healthcare Informatics: Proceedings, IEEE Computer Society, 2014, 285-293 s.Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Electronic health records (EHRs) provide a potentially valuable source of information for pharmacovigilance. However, adverse drug events (ADEs), which can be encoded in EHRs with specific diagnosis codes, are heavily under-reported. To provide more accurate estimates for drug safety surveillance, machine learning systems that are able to detect ADEs could be used to identify and suggest missing ADE-specific diagnosis codes. A fundamental consideration when building such systems is how to represent the EHR data to allow for accurate predictive modeling. In this study, two types of clinical code are used to represent drugs and diagnoses: the Anatomical Therapeutic Chemical Classification System (ATC) and the International Statistical Classification of Diseases and Health Problems (ICD). More specifically, it is investigated whether their hierarchical structure can be exploited to improve predictive performance. The use of random forests with feature sets that include only the original, low-level, codes is compared to using random forests with feature sets that contain all levels in the hierarchies. An empirical investigation using thirty datasets with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, can be significantly improved by exploiting codes on all levels in the hierarchies, compared to using only the low-level encoding. A further analysis is presented in which two strategies are employed for adding features level-wise according to the concept hierarchies: top-down, starting with the highest abstraction levels, and bottom-up, starting with the most specific encoding. The main finding from this subsequent analysis is that predictive performance can be kept at a high level even without employing the more specific levels in the concept hierarchies.

sted, utgiver, år, opplag, sider
IEEE Computer Society, 2014
Emneord
Clinical codes, concept hierarchy, electronic health records, adverse drug events, data mining
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-110969 (URN)10.1109/ICHI.2014.46 (DOI)978-1-4799-5701-9 (ISBN)
Konferanse
IEEE International Conference on Healthcare Informatics, Verona, Italy, 15-17 September 2014
Tilgjengelig fra: 2014-12-19 Laget: 2014-12-19 Sist oppdatert: 2017-01-30bibliografisk kontrollert
2. Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements
Åpne denne publikasjonen i ny fane eller vindu >>Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements
2014 (engelsk)Inngår i: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): Proceedings, IEEE Computer Society, 2014, 536-543 s.Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Adverse drug events (ADEs) are grossly under-reported in electronic health records (EHRs). This could be mitigated by methods that are able to detect ADEs in EHRs, thereby allowing for missing ADE-specific diagnosis codes to be identified and added. A crucial aspect of constructing such systems is to find proper representations of the data in order to allow the predictive modeling to be as accurate as possible. One category of EHR data that can be used as indicators of ADEs are clinical measurements. However, using clinical measurements as features is not unproblematic due to the high rate of missing values and they can be repeated a variable number of times in each patient health record. In this study, five basic representations of clinical measurements are proposed and evaluated to handle these two problems. An empirical investigation using random forest on 27 datasets from a real EHR database with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, is higher when representing clinical measurements crudely as whether they were taken or how many times they were taken by a patient. Furthermore, a sixth alternative, combining all five basic representations, significantly outperforms using any of the basic representation except for one. A subsequent analysis of variable importance is also conducted with this fused feature set, showing that when clinical measurements have a high missing rate, the number of times they were taken by one patient is ranked as more informative than looking at their actual values. The observation from random forest is also confirmed empirically using other commonly employed classifiers. This study demonstrates that the way in which clinical measurements from EHRs are presented has a high impact for ADE detection, and that using multiple representations outperforms using a basic representation.

sted, utgiver, år, opplag, sider
IEEE Computer Society, 2014
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-110970 (URN)10.1109/BIBM.2014.6999216 (DOI)978-1-4799-5669-2 (ISBN)
Konferanse
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference, Belfast, UK, 2-5 November, 2014
Tilgjengelig fra: 2014-12-19 Laget: 2014-12-19 Sist oppdatert: 2017-01-30bibliografisk kontrollert
3. Predictive modeling of structured electronic health records for adverse drug event detection
Åpne denne publikasjonen i ny fane eller vindu >>Predictive modeling of structured electronic health records for adverse drug event detection
2015 (engelsk)Inngår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 15, nr SIArtikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background: The digitization of healthcare data, resulting from the increasingly widespread adoption of electronic health records, has greatly facilitated its analysis by computational methods and thereby enabled large-scale secondary use thereof. This can be exploited to support public health activities such as pharmacovigilance, wherein the safety of drugs is monitored to inform regulatory decisions about sustained use. To that end, electronic health records have emerged as a potentially valuable data source, providing access to longitudinal observations of patient treatment and drug use. A nascent line of research concerns predictive modeling of healthcare data for the automatic detection of adverse drug events, which presents its own set of challenges: it is not yet clear how to represent the heterogeneous data types in a manner conducive to learning high-performing machine learning models. Methods: Datasets from an electronic health record database are used for learning predictive models with the purpose of detecting adverse drug events. The use and representation of two data types, as well as their combination, are studied: clinical codes, describing prescribed drugs and assigned diagnoses, and measurements. Feature selection is conducted on the various types of data to reduce dimensionality and sparsity, while allowing for an in-depth feature analysis of the usefulness of each data type and representation. Results: Within each data type, combining multiple representations yields better predictive performance compared to using any single representation. The use of clinical codes for adverse drug event detection significantly outperforms the use of measurements; however, there is no significant difference over datasets between using only clinical codes and their combination with measurements. For certain adverse drug events, the combination does, however, outperform using only clinical codes. Feature selection leads to increased predictive performance for both data types, in isolation and combined. Conclusions: We have demonstrated how machine learning can be applied to electronic health records for the purpose of detecting adverse drug events and proposed solutions to some of the challenges this presents, including how to represent the various data types. Overall, clinical codes are more useful than measurements and, in specific cases, it is beneficial to combine the two.

Emneord
pharmacovigilance, adverse drug events, electronic health records, machine learning, random forest, feature selection
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-123965 (URN)10.1186/1472-6947-15-S4-S1 (DOI)000367480700001 ()
Konferanse
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, North Ireland, November 02-05, 2015
Tilgjengelig fra: 2015-12-09 Laget: 2015-12-09 Sist oppdatert: 2017-01-30bibliografisk kontrollert
4. Cascading Adverse Drug Event Detection in Electronic Health Records
Åpne denne publikasjonen i ny fane eller vindu >>Cascading Adverse Drug Event Detection in Electronic Health Records
2015 (engelsk)Inngår i: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): Proceedings, IEEE Computer Society, 2015Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The ability to detect adverse drug events (ADEs) in electronic health records (EHRs) is useful in many medical applications, such as alerting systems that indicate when an ADE-specific diagnosis code should be assigned. Automating the detection of ADEs can be attempted by applying machine learning to existing, labeled EHR data. How to do this in an effective manner is, however, an open question. The issues addressed in this study concern the granularity of the classification task: (1) If we wish to predict the occurrence of ADE, is it advantageous to conflate the various ADE class labels prior to learning, or should they be merged post prediction? (2) If we wish to predict a family of ADEs or even a specific ADE, can the predictive performance be enhanced by dividing the classification task into a cascading scheme: predicting first, on a coarse level, whether there is an ADE or not, and, in the former case, followed by a more specific prediction on which family the ADE belongs to, and then finally a prediction on the specific ADE within that particular family? In this study, we conduct a series of experiments using a real, clinical dataset comprising healthcare episodes that have been assigned one of eight ADE-related diagnosis codes and a set of randomly extracted episodes that have not been assigned any ADE code. It is shown that, when distinguishing between ADEs and non-ADEs, merging the various ADE labels prior to learning leads to significantly higher predictive performance in terms of accuracy and area under ROC curve. A cascade of random forests is moreover constructed to determine either the family of ADEs or the specific class label; here, the performance is indeed enhanced compared to directly employing a one-step prediction. This study concludes that, if predictive performance is of primary importance, the cascading scheme should be the recommended approach over employing a one-step prediction for detecting ADEs in EHRs.

sted, utgiver, år, opplag, sider
IEEE Computer Society, 2015
Emneord
electronic health records, adverse drug events, predictive modeling, cascading
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122795 (URN)10.1109/DSAA.2015.7344869 (DOI)978-1-4673-8272-4 (ISBN)978-1-4673-8273-1 (ISBN)
Konferanse
2015 IEEE International Conference on Data Science and Advanced Analytics, Paris, France, 19-21 October, 2015
Tilgjengelig fra: 2015-11-11 Laget: 2015-11-10 Sist oppdatert: 2017-01-30bibliografisk kontrollert
5. Handling Temporality of Clinical Events for Drug Safety Surveillance
Åpne denne publikasjonen i ny fane eller vindu >>Handling Temporality of Clinical Events for Drug Safety Surveillance
Vise andre…
2015 (engelsk)Inngår i: AMIA Annual Symposium Proceedings, ISSN 1559-4076, Vol. 2015, 1371-1380 s.Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Using longitudinal data in electronic health records (EHRs) for post-marketing adverse drug event (ADE) detection allows for monitoring patients throughout their medical history. Machine learning methods have been shown to be efficient and effective in screening health records and detecting ADEs. How best to exploit historical data, as encoded by clinical events in EHRs is, however, not very well understood. In this study, three strategies for handling temporality of clinical events are proposed and evaluated using an EHR database from Stockholm, Sweden. The random forest learning algorithm is applied to predict fourteen ADEs using clinical events collected from different lengths of patient history. The results show that, in general, including longer patient history leads to improved predictive performance, and that assigning weights to events according to time distance from the ADE yields the biggest improvement.

Emneord
drug safety surveillance, pharmacovigilance, adverse drug events, electronic health records, temporality, predictive modeling
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-123950 (URN)
Tilgjengelig fra: 2015-12-09 Laget: 2015-12-09 Sist oppdatert: 2017-01-23bibliografisk kontrollert
6. Temporal weighting of clinical events in electronic health records for pharmacovigilance
Åpne denne publikasjonen i ny fane eller vindu >>Temporal weighting of clinical events in electronic health records for pharmacovigilance
2015 (engelsk)Inngår i: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, 375-381 s.Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Electronic health records (EHRs) have recently been identified as a potentially valuable source for monitoring adverse drug events (ADEs). However, ADEs are heavily under- reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account temporality when using clinical events, which are time stamped in EHRs, as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, how to assign weights in an optimal manner remains unexplored. In this study, nine different temporal weighting strategies are proposed and evaluated using data extracted from a Swedish EHR database, where the predictive performance of models constructed with the random forest learning algorithm is compared. Moreover, variable importance is analyzed to obtain a deeper understanding as to why a certain weighting strategy is favored over another, as well as which clinical events undergo the biggest changes in importance with the various weighting strategies. The results show that the choice of weighting strategy has a significant impact on the predictive performance for ADE detection, and that the best choice of weighting strategy depends on the target ADE and, specifically, on its dose-dependency.

sted, utgiver, år, opplag, sider
IEEE Computer Society, 2015
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-123971 (URN)10.1109/BIBM.2015.7359710 (DOI)978-1-4673-6798-1 (ISBN)
Konferanse
2015 IEEE International Conference on Bioinformatics and Biomedicine, Washington DC, USA, 9-12 November 2015
Tilgjengelig fra: 2015-12-09 Laget: 2015-12-09 Sist oppdatert: 2017-01-23bibliografisk kontrollert
7. Learning temporal weights of clinical events using variable importance
Åpne denne publikasjonen i ny fane eller vindu >>Learning temporal weights of clinical events using variable importance
2016 (engelsk)Inngår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, nr Suppl. 2, 71Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background: Longitudinal data sources, such as electronic health records (EHRs), are very valuable for monitoring adverse drug events (ADEs). However, ADEs are heavily under-reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account the temporality of clinical events, which are time stamped in EHRs, and providing these as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, the weights were in that case pre-assigned according to their time stamps, which is limited and potentially less accurate. This study therefore focuses on how to learn weights that effectively take into account the temporality and importance of clinical events for ADE detection. Methods: Variable importance obtained from the random forest learning algorithm is used for extracting temporal weights. Two strategies are proposed for applying the learned weights: weighted aggregation and weighted sampling. The first strategy aggregates the weighted clinical events from different time windows to form new features; the second strategy retains the original features but samples them by using their weights as probabilities when building each tree in the forest. The predictive performance of random forest models using the learned weights with the two strategies is compared to using pre-assigned weights. In addition, to assess the sensitivity of the weight-learning procedure, weights from different granularity levels are evaluated and compared. Results: In the weighted sampling strategy, using learned weights significantly improves the predictive performance, in comparison to using pre-assigned weights; however, there is no significant difference between them in the weighted aggregation strategy. Moreover, the granularity of the weight learning procedure has a significant impact on the former, but not on the latter. Conclusions: Learning temporal weights is significantly beneficial in terms of predictive performance with the weighted sampling strategy. Moreover, weighted aggregation generally diminishes the impact of temporal weighting of the clinical events, irrespective of whether the weights are pre-assigned or learned.

Emneord
Learning weights, Temporality, Adverse drug events, Electronic health records, Machine learning, Random forest, Pharmacovigilance
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-136593 (URN)10.1186/s12911-016-0311-6 (DOI)
Konferanse
IEEE International Conference on Bioinformatics and Biomedicine 2015, Washington, DC, USA, 9–12 November 2015
Tilgjengelig fra: 2016-12-12 Laget: 2016-12-12 Sist oppdatert: 2017-01-23bibliografisk kontrollert
8. Learning from heterogeneous temporal data from electronic health records
Åpne denne publikasjonen i ny fane eller vindu >>Learning from heterogeneous temporal data from electronic health records
2017 (engelsk)Inngår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 65, 105-119 s.Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Electronic health records contain large amounts of longitudinal data that are valuable for biomedical informatics research. The application of machine learning is a promising alternative to manual analysis of such data. However, the complex structure of the data, which includes clinical events that are unevenly distributed over time, poses a challenge for standard learning algorithms. Some approaches to modeling temporal data rely on extracting single values from time series; however, this leads to the loss of potentially valuable sequential information. How to better account for the temporality of clinical data, hence, remains an important research question. In this study, novel representations of temporal data in electronic health records are explored. These representations retain the sequential information, and are directly compatible with standard machine learning algorithms. The explored methods are based on symbolic sequence representations of time series data, which are utilized in a number of different ways. An empirical investigation, using 19 datasets comprising clinical measurements observed over time from a real database of electronic health records, shows that using a distance measure to random subsequences leads to substantial improvements in predictive performance compared to using the original sequences or clustering the sequences. Evidence is moreover provided on the quality of the symbolic sequence representation by comparing it to sequences that are generated using domain knowledge by clinical experts. The proposed method creates representations that better account for the temporality of clinical events, which is often key to prediction tasks in the biomedical domain.

Emneord
random subsequence, time series classification, electronic health records, data mining, machine learning
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-137481 (URN)10.1016/j.jbi.2016.11.006 (DOI)000406235200008 ()
Tilgjengelig fra: 2017-01-08 Laget: 2017-01-08 Sist oppdatert: 2017-08-21bibliografisk kontrollert

Open Access i DiVA

Learning Predictive Models from Electronic Health Records(972 kB)359 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 972 kBChecksum SHA-512
5bc7fa4168368b12fa4b3e9f32ca2c9d70a4d6b75ab8eb9e257c474b067a0d82018584b23b908083ab93c514b1a56f77f72da8c98ed29b8f25056ede5ee866da
Type fulltextMimetype application/pdf

Søk i DiVA

Av forfatter/redaktør
Zhao, Jing
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 359 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

Totalt: 2267 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf