Ändra sökning
Avgränsa sökresultatet
1 - 32 av 32
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. Ahltorp, Magnus
    et al.
    Skeppstedt, Maria
    Kitajima, Shiho
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rzepka, Rafal
    Araki, Kenji
    Expansion of medical vocabularies using distributional semantics on Japanese patient blogs2016Ingår i: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 7, artikel-id 58Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. Methods: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3x100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. Results: Removing case particles and using a context window size of 1 + 1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8 + 8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. Conclusions: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

  • 2. Berndorfer, Stefan
    et al.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Automated Diagnosis Coding with Combined Text Representations2017Ingår i: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, s. 201-2015Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Automated diagnosis coding can be provided efficiently by learning predictive models from historical data; however, discriminating between thousands of codes while allowing a variable number of codes to be assigned is extremely difficult. Here, we explore various text representations and classification models for assigning ICD-9 codes to discharge summaries in MIMIC-III. It is shown that the relative effectiveness of the investigated representations depends on the frequency of the diagnosis code under consideration and that the best performance is obtained by combining models built using different representations.

  • 3.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care2012Ingår i: Proceedings of SLCT 2012: The Fourth Swedish Language Technology Conference, 2012, s. 17-18Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    The care of patients is well documented in health records. Despite being a valuable source of information that could be mined by computers and used to improve health care, health records are not readily available for research. Moreover, the narrative parts of the records are noisy and need to be interpreted by domain experts. In this abstract we describe our experiences of gaining access to a database of electronic health records for research. We also highlight some important issues in this domain and describe a number of possible applications, including comorbidity networks, detection of hospital-acquired infections and adverse drug reactions, as well as diagnosis coding support.

  • 4.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Weegar, Rebecka
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    HEALTH BANK - A Workbench for Data Science Applications in Healthcare2015Ingår i: Industry Track Workshop, CEUR Workshop Proceedings , 2015, Vol. 1381, s. 1-18Konferensbidrag (Refereegranskat)
    Abstract [en]

    The enormous amounts of data that are generated in the healthcare process and stored in electronic health record (EHR) systems are an underutilized resource that, with the use of data science applica- tions, can be exploited to improve healthcare. To foster the development and use of data science applications in healthcare, there is a fundamen- tal need for access to EHR data, which is typically not readily available to researchers and developers. A relatively rare exception is the large EHR database, the Stockholm EPR Corpus, comprising data from more than two million patients, that has been been made available to a lim- ited group of researchers at Stockholm University. Here, we describe a number of data science applications that have been developed using this database, demonstrating the potential reuse of EHR data to support healthcare and public health activities, as well as facilitate medical re- search. However, in order to realize the full potential of this resource, it needs to be made available to a larger community of researchers, as well as to industry actors. To that end, we envision the provision of an in- frastructure around this database called HEALTH BANK – the Swedish Health Record Research Bank. It will function both as a workbench for the development of data science applications and as a data explo- ration tool, allowing epidemiologists, pharmacologists and other medical researchers to generate and evaluate hypotheses. Aggregated data will be fed into a pipeline for open e-access, while non-aggregated data will be provided to researchers within an ethical permission framework. We believe that HEALTH BANK has the potential to promote a growing industry around the development of data science applications that will ultimately increase the efficiency and effectiveness of healthcare.

  • 5.
    Dziadek, Juliusz
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction2017Ingår i: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, s. 241-245Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The mapping of unstructured clinical text to an ontology facilitates meaningful secondary use of health records but is non-trivial due to lexical variation and the abundance of misspellings in hurriedly produced notes. Here, we apply several spelling correction methods to Swedish medical text and evaluate their impact on SNOMED CT mapping; first in a controlled evaluation using medical literature text with induced errors, followed by a partial evaluation on clinical notes. It is shown that the best-performing method is context-sensitive, taking into account trigram frequencies and utilizing a corpus-based dictionary.

  • 6.
    Grigonyte, Gintare
    et al.
    Stockholms universitet, Humanistiska fakulteten, Institutionen för lingvistik, Avdelningen för datorlingvistik.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Wirén, Mats
    Stockholms universitet, Humanistiska fakulteten, Institutionen för lingvistik, Avdelningen för datorlingvistik.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Swedification patterns of Latin and Greek affixes in clinical text2016Ingår i: Nordic Journal of Linguistics, ISSN 0332-5865, E-ISSN 1502-4717, Vol. 39, nr 1, s. 5-37Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes.

  • 7.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare2015Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
    Abstract [en]

    Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

    All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

    The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

  • 8.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning multiple distributed prototypes of semantic categories for named entity recognition2015Ingår i: International Journal of Data Mining and Bioinformatics, ISSN 1748-5681, Vol. 13, nr 4, s. 395-411Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied – the size of the context window – and an experimental investigation shows that this leads to further performance gains.

  • 9.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Semantic Spaces of Clinical Text: Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records2013Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)
    Abstract [sv]

    De stora mängder kliniska data som genereras i patientjournalsystem är en underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården. Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande för datorer att analysera, finns det ett behov av sofistikerade metoder som kan behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på den begränsade tillgången till annoterade korpusar i den kliniska domänen.

    System för informationsextraktion och språkbehandling behöver innehålla viss kunskap om semantik. En metod går ut på att utnyttja de distributionella egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer – för att modellera den relativa betydelsen av termer i ett högdimensionellt vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling av allmänna språk; dess tillämpning i den kliniska domänen har dock endast utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad medicinskt relevanta tillämpningar.

    Tillämpningen av distributionell semantik i den kliniska domänen illustreras här i tre användningsområden: (1) synonymextraktion av medicinska termer, (2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar. Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjliggöra en effektiv tillämpning av distributionell semantik på ett brett spektrum av uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen av negation: ett enkelt sätt att modellera parafrasering och negation i ett distributionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination av olika modeller för distributionell semantik, med olika parameterkonfigurationer samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när vokabulären är omfattande, vilket är vanligt i den kliniska domänen.

  • 10.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Conway, Mike
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Chapman, Wendy W.
    Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records2013Ingår i: AMIA Conference Proceedings Archive, American Medical Informatics Association , 2013, s. 600-609Konferensbidrag (Refereegranskat)
    Abstract [en]

    Medical terminologies and ontologies are important tools for natural language processing of health record narratives. To account for the variability of language use, synonyms need to be stored in a semantic resource as textual instantiations of a concept. Developing such resources manually is, however, prohibitively expensive and likely to result in low coverage. To facilitate and expedite the process of lexical resource development, distributional analysis of large corpora provides a powerful data-driven means of (semi-)automatically identifying semantic relations, including synonymy, between terms. In this paper, we demonstrate how distributional analysis of a large corpus of electronic health records – the MIMIC-II database – can be employed to extract synonyms of SNOMED CT preferred terms. A distinctive feature of our method is its ability to identify synonymous relations between terms of varying length.

  • 11.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kowalski, Stewart
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records2014Ingår i: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, s. 450-457Konferensbidrag (Refereegranskat)
    Abstract [en]

    Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

  • 12.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support2013Ingår i: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (Louhi 2013) / [ed] Hanna Suominen, NICTA , 2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    In natural language processing, dimensionality reduction is a common technique to reduce complexity that simultaneously addresses the sparseness property of language. It is also used as a means to capture some latent structure in text, such as the underlying semantics. Dimensionality reduction is an important property of the word space model, not least in random indexing, where the dimensionality is a predefined model parameter. In this paper, we demonstrate the importance of dimensionality optimization and discuss correlations between dimensionality and the size of the vocabulary. This is of particular importance in the clinical domain, where the level of noise in the text leads to a large vocabulary; it may also mitigate the effect of exploding vocabulary sizes when modeling multiword terms as single tokens. A system that automatically assigns diagnosis codes to patient record entries is shown to improve by up to 18 percentage points by manually optimizing the dimensionality.

  • 13.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detecting Protected Health Information in Heterogeneous Clinical Notes2017Ingår i: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 245, s. 393-397Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    To enable secondary use of healthcare data in a privacy-preserving manner, there is a need for methods capable of automatically identifying protected health information (PHI) in clinical text. To that end, learning predictive models from labeled examples has emerged as a promising alternative to rule-based systems. However, little is known about differences with respect to PHI prevalence in different types of clinical notes and how potential domain differences may affect the performance of predictive models trained on one particular type of note and applied to another. In this study, we analyze the performance of a predictive model trained on an existing PHI corpus of Swedish clinical notes and applied to a variety of clinical notes: written (i) in different clinical specialties, (ii) under different headings, and (iii) by persons in different professions. The results indicate that domain adaption is needed for effective detection of PHI in heterogeneous clinical notes.

  • 14.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Prevalence Estimation of Protected Health Information in Swedish Clinical Text2017Ingår i: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, s. 216-220Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.

  • 15.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Identifying adverse drug event information in clinical notes with distributional semantic representations of context2015Ingår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, s. 333-349Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

  • 16.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska University Hospital.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text2012Ingår i:  , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    A novel method for identifying potential side-effects to medications through large-scale analysis of clinical data is here introduced and evaluated. By calculating distributional similarities for medication-symptom pairs based on co-occurrence information in a large clinical corpus, many known adverse drug reactions are successfully identified. These preliminary results suggest that semantic vector space models of clinical text could also be used to generate hypotheses about potentially unknown adverse drug reactions. In the best model, 50% of the terms in a list of twenty are considered to be conceivable side-effects. Among the medication-symptom pairs, however, diagnostic indications and terms related to the medication in other ways also appear. These relations need to be distinguished in a more refined method for detecting adverse drug reactions.

  • 17.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Moen, Hans
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Daudaravičius, Vidas
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Synonym extraction and abbreviation expansion with ensembles of semantic spaces2014Ingår i: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 5, nr 6Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

  • 18.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Moen, Hans
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Eklund, Ann-Marie
    Daudaravičius, Vidas
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models2012Ingår i: Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), 2012, s. 10-17Konferensbidrag (Refereegranskat)
    Abstract [en]

    In information extraction, it is useful to know if two signifiers have the same or very similar semantic content. Maintaining such information in a controlled vocabulary is, however, costly. Here it is demonstrated how synonyms of medical terms can be extracted automatically from a large corpus of clinical text using distributional semantics. By combining Random Indexing and Random Permutation, different lexical semantic aspects are captured, effectively increasing our ability to identify synonymic relations between terms. 44% of 340 synonym pairs from MeSH are successfully extracted in a list of ten suggestions. The models can also be used to map abbreviations to their full-length forms; simple pattern-based filtering of the suggestions yields substantial improvements.

  • 19.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institute, Sweden.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Conway, Mike
    Corpus-Driven Terminology Development: Populating Swedish SNOMED CT with Synonyms Extracted from Electronic Health Records2013Ingår i: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing (BioNLP 2013), Association for Computational Linguistics, 2013, s. 36-44Konferensbidrag (Refereegranskat)
    Abstract [en]

    The various ways in which one can refer to the same clinical concept needs to be accounted for in a semantic resource such as SNOMED CT. Developing terminological resources manually is, however, prohibitively expensive and likely to result in low coverage, especially given the high variability of language use in clinical text. To support this process, distributional methods can be employed in conjunction with a large corpus of electronic health records to extract synonym candidates for clinical terms. In this paper, we exemplify the potential of our proposed method using the Swedish version of SNOMED CT, which currently lacks synonyms. A medical expert inspects two thousand term pairs generated by two semantic spaces -- one of which models multiword terms in addition to single words -- for one hundred preferred terms of the semantic types disorder and finding.

  • 20.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015Ingår i: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, s. 343-350Konferensbidrag (Refereegranskat)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 21.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection2015Ingår i: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, Institute of Electrical and Electronics Engineers (IEEE), 2015Konferensbidrag (Refereegranskat)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 22.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ensembles of randomized trees using diverse distributed representations of clinical events2016Ingår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, artikel-id 69Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

  • 23.
    Skeppstedt, Maria
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ahltorp, Magnus
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Vocabulary Expansion by Semantic Extraction of Medical Terms2013Ingår i: Proceedings of the 5th International Symposiumon Languages in Biology and Medicine, 2013, s. 63-68Konferensbidrag (Refereegranskat)
    Abstract [en]

    Automatic methods for vocabulary expansion are valuable in supporting the development of terminological resources. Here, we evaluate two methods based on distributional semantics for extracting terms that belong to a certain semantic category. In a list of 1000 terms extracted from a corpus of Swedish medical text, the best method obtains a recall of 0.53 and 0.88, respectively, for identifying 90 terms that are known to belong to the semantic categories Medical Finding and Pharmaceutical Drug.

  • 24. Tengstrand, Lisa
    et al.
    Megyesi, Beata
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    EACL - Expansion of Abbreviations in CLinical text2014Ingår i: Proceedings of the 3rdWorkshop on Predicting and Improving Text Readability for Target Reader Population, Association for Computational Linguistics , 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians’ notes, abbreviations need to be identified and expanded to their original forms. We present a distributional semantic approach to find candidates of the original form of the abbreviation, and combine this with Levenshtein distance to choose the correct candidate among the semantically related words. We apply the method to radiology reports and medical journal texts, and compare the results to general Swedish. The results show that the correct expansion of the abbreviation can be found in 40% of the cases, an improvement by 24 percentage points compared to the baseline (0.16), and an increase by 22 percentage points compared to using word space models alone (0.18).

  • 25.
    Velupillai, Sumithra
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, MartinStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.Henriksson, AronStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.Kvist, MariaStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.Skeppstedt, MariaStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.Dalianis, HerculesStockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Louhi 2014: Special issue on health text mining and information analysis2015Proceedings (redaktörskap) (Refereegranskat)
  • 26.
    Velupillai, Sumithra
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Louhi 2014: Special issue on health text mining and information analysis: introduction2015Ingår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 2, nr SI, s. 1-3Artikel i tidskrift (Refereegranskat)
  • 27.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Learning temporal weights of clinical events using variable importance2016Ingår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, nr Suppl. 2, artikel-id 71Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Longitudinal data sources, such as electronic health records (EHRs), are very valuable for monitoring adverse drug events (ADEs). However, ADEs are heavily under-reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account the temporality of clinical events, which are time stamped in EHRs, and providing these as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, the weights were in that case pre-assigned according to their time stamps, which is limited and potentially less accurate. This study therefore focuses on how to learn weights that effectively take into account the temporality and importance of clinical events for ADE detection. Methods: Variable importance obtained from the random forest learning algorithm is used for extracting temporal weights. Two strategies are proposed for applying the learned weights: weighted aggregation and weighted sampling. The first strategy aggregates the weighted clinical events from different time windows to form new features; the second strategy retains the original features but samples them by using their weights as probabilities when building each tree in the forest. The predictive performance of random forest models using the learned weights with the two strategies is compared to using pre-assigned weights. In addition, to assess the sensitivity of the weight-learning procedure, weights from different granularity levels are evaluated and compared. Results: In the weighted sampling strategy, using learned weights significantly improves the predictive performance, in comparison to using pre-assigned weights; however, there is no significant difference between them in the weighted aggregation strategy. Moreover, the granularity of the weight learning procedure has a significant impact on the former, but not on the latter. Conclusions: Learning temporal weights is significantly beneficial in terms of predictive performance with the weighted sampling strategy. Moreover, weighted aggregation generally diminishes the impact of temporal weighting of the clinical events, irrespective of whether the weights are pre-assigned or learned.

  • 28.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements2014Ingår i: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): Proceedings, IEEE Computer Society, 2014, s. 536-543Konferensbidrag (Refereegranskat)
    Abstract [en]

    Adverse drug events (ADEs) are grossly under-reported in electronic health records (EHRs). This could be mitigated by methods that are able to detect ADEs in EHRs, thereby allowing for missing ADE-specific diagnosis codes to be identified and added. A crucial aspect of constructing such systems is to find proper representations of the data in order to allow the predictive modeling to be as accurate as possible. One category of EHR data that can be used as indicators of ADEs are clinical measurements. However, using clinical measurements as features is not unproblematic due to the high rate of missing values and they can be repeated a variable number of times in each patient health record. In this study, five basic representations of clinical measurements are proposed and evaluated to handle these two problems. An empirical investigation using random forest on 27 datasets from a real EHR database with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, is higher when representing clinical measurements crudely as whether they were taken or how many times they were taken by a patient. Furthermore, a sixth alternative, combining all five basic representations, significantly outperforms using any of the basic representation except for one. A subsequent analysis of variable importance is also conducted with this fused feature set, showing that when clinical measurements have a high missing rate, the number of times they were taken by one patient is ranked as more informative than looking at their actual values. The observation from random forest is also confirmed empirically using other commonly employed classifiers. This study demonstrates that the way in which clinical measurements from EHRs are presented has a high impact for ADE detection, and that using multiple representations outperforms using a basic representation.

  • 29.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Predictive modeling of structured electronic health records for adverse drug event detection2015Ingår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 15, nr SIArtikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: The digitization of healthcare data, resulting from the increasingly widespread adoption of electronic health records, has greatly facilitated its analysis by computational methods and thereby enabled large-scale secondary use thereof. This can be exploited to support public health activities such as pharmacovigilance, wherein the safety of drugs is monitored to inform regulatory decisions about sustained use. To that end, electronic health records have emerged as a potentially valuable data source, providing access to longitudinal observations of patient treatment and drug use. A nascent line of research concerns predictive modeling of healthcare data for the automatic detection of adverse drug events, which presents its own set of challenges: it is not yet clear how to represent the heterogeneous data types in a manner conducive to learning high-performing machine learning models. Methods: Datasets from an electronic health record database are used for learning predictive models with the purpose of detecting adverse drug events. The use and representation of two data types, as well as their combination, are studied: clinical codes, describing prescribed drugs and assigned diagnoses, and measurements. Feature selection is conducted on the various types of data to reduce dimensionality and sparsity, while allowing for an in-depth feature analysis of the usefulness of each data type and representation. Results: Within each data type, combining multiple representations yields better predictive performance compared to using any single representation. The use of clinical codes for adverse drug event detection significantly outperforms the use of measurements; however, there is no significant difference over datasets between using only clinical codes and their combination with measurements. For certain adverse drug events, the combination does, however, outperform using only clinical codes. Feature selection leads to increased predictive performance for both data types, in isolation and combined. Conclusions: We have demonstrated how machine learning can be applied to electronic health records for the purpose of detecting adverse drug events and proposed solutions to some of the challenges this presents, including how to represent the various data types. Overall, clinical codes are more useful than measurements and, in specific cases, it is beneficial to combine the two.

  • 30.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Cascading Adverse Drug Event Detection in Electronic Health Records2015Ingår i: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): Proceedings, IEEE Computer Society, 2015Konferensbidrag (Refereegranskat)
    Abstract [en]

    The ability to detect adverse drug events (ADEs) in electronic health records (EHRs) is useful in many medical applications, such as alerting systems that indicate when an ADE-specific diagnosis code should be assigned. Automating the detection of ADEs can be attempted by applying machine learning to existing, labeled EHR data. How to do this in an effective manner is, however, an open question. The issues addressed in this study concern the granularity of the classification task: (1) If we wish to predict the occurrence of ADE, is it advantageous to conflate the various ADE class labels prior to learning, or should they be merged post prediction? (2) If we wish to predict a family of ADEs or even a specific ADE, can the predictive performance be enhanced by dividing the classification task into a cascading scheme: predicting first, on a coarse level, whether there is an ADE or not, and, in the former case, followed by a more specific prediction on which family the ADE belongs to, and then finally a prediction on the specific ADE within that particular family? In this study, we conduct a series of experiments using a real, clinical dataset comprising healthcare episodes that have been assigned one of eight ADE-related diagnosis codes and a set of randomly extracted episodes that have not been assigned any ADE code. It is shown that, when distinguishing between ADEs and non-ADEs, merging the various ADE labels prior to learning leads to significantly higher predictive performance in terms of accuracy and area under ROC curve. A cascade of random forests is moreover constructed to determine either the family of ADEs or the specific class label; here, the performance is indeed enhanced compared to directly employing a one-step prediction. This study concludes that, if predictive performance is of primary importance, the cascading scheme should be the recommended approach over employing a one-step prediction for detecting ADEs in EHRs.

  • 31.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes2014Ingår i: 2014 IEEE International Conference on Healthcare Informatics: Proceedings, IEEE Computer Society, 2014, s. 285-293Konferensbidrag (Refereegranskat)
    Abstract [en]

    Electronic health records (EHRs) provide a potentially valuable source of information for pharmacovigilance. However, adverse drug events (ADEs), which can be encoded in EHRs with specific diagnosis codes, are heavily under-reported. To provide more accurate estimates for drug safety surveillance, machine learning systems that are able to detect ADEs could be used to identify and suggest missing ADE-specific diagnosis codes. A fundamental consideration when building such systems is how to represent the EHR data to allow for accurate predictive modeling. In this study, two types of clinical code are used to represent drugs and diagnoses: the Anatomical Therapeutic Chemical Classification System (ATC) and the International Statistical Classification of Diseases and Health Problems (ICD). More specifically, it is investigated whether their hierarchical structure can be exploited to improve predictive performance. The use of random forests with feature sets that include only the original, low-level, codes is compared to using random forests with feature sets that contain all levels in the hierarchies. An empirical investigation using thirty datasets with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, can be significantly improved by exploiting codes on all levels in the hierarchies, compared to using only the low-level encoding. A further analysis is presented in which two strategies are employed for adding features level-wise according to the concept hierarchies: top-down, starting with the highest abstraction levels, and bottom-up, starting with the most specific encoding. The main finding from this subsequent analysis is that predictive performance can be kept at a high level even without employing the more specific levels in the concept hierarchies.

  • 32.
    Zhao, Jing
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institute, Sweden.
    Asker, Lars
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Handling Temporality of Clinical Events for Drug Safety Surveillance2015Ingår i: AMIA Annual Symposium Proceedings, ISSN 1559-4076, Vol. 2015, s. 1371-1380Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Using longitudinal data in electronic health records (EHRs) for post-marketing adverse drug event (ADE) detection allows for monitoring patients throughout their medical history. Machine learning methods have been shown to be efficient and effective in screening health records and detecting ADEs. How best to exploit historical data, as encoded by clinical events in EHRs is, however, not very well understood. In this study, three strategies for handling temporality of clinical events are proposed and evaluated using an EHR database from Stockholm, Sweden. The random forest learning algorithm is applied to predict fourteen ADEs using clinical events collected from different lengths of patient history. The results show that, in general, including longer patient history leads to improved predictive performance, and that assigning weights to events according to time distance from the ADE yields the biggest improvement.

1 - 32 av 32
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf