Change search
Refine search result
1 - 29 of 29
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the 'Create feeds' function.
  • 1. Ahltorp, Magnus
    et al.
    Skeppstedt, Maria
    Kitajima, Shiho
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Rzepka, Rafal
    Araki, Kenji
    Expansion of medical vocabularies using distributional semantics on Japanese patient blogs2016In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 7, 58Article in journal (Refereed)
    Abstract [en]

    Background: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. Methods: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3x100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. Results: Removing case particles and using a context window size of 1 + 1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8 + 8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. Conclusions: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

  • 2. Berndorfer, Stefan
    et al.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Automated Diagnosis Coding with Combined Text Representations2017In: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, 201-2015 p.Article in journal (Refereed)
    Abstract [en]

    Automated diagnosis coding can be provided efficiently by learning predictive models from historical data; however, discriminating between thousands of codes while allowing a variable number of codes to be assigned is extremely difficult. Here, we explore various text representations and classification models for assigning ICD-9 codes to discharge summaries in MIMIC-III. It is shown that the relative effectiveness of the investigated representations depends on the frequency of the diagnosis code under consideration and that the best performance is obtained by combining models built using different representations.

  • 3.
    Dalianis, Hercules
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Hassel, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Skeppstedt, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care2012In: Proceedings of SLCT 2012: The Fourth Swedish Language Technology Conference, 2012, 17-18 p.Conference paper (Other academic)
    Abstract [en]

    The care of patients is well documented in health records. Despite being a valuable source of information that could be mined by computers and used to improve health care, health records are not readily available for research. Moreover, the narrative parts of the records are noisy and need to be interpreted by domain experts. In this abstract we describe our experiences of gaining access to a database of electronic health records for research. We also highlight some important issues in this domain and describe a number of possible applications, including comorbidity networks, detection of hospital-acquired infections and adverse drug reactions, as well as diagnosis coding support.

  • 4.
    Dalianis, Hercules
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Weegar, Rebecka
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    HEALTH BANK - A Workbench for Data Science Applications in Healthcare2015In: Industry Track Workshop, CEUR Workshop Proceedings , 2015, Vol. 1381, 1-18 p.Conference paper (Refereed)
    Abstract [en]

    The enormous amounts of data that are generated in the healthcare process and stored in electronic health record (EHR) systems are an underutilized resource that, with the use of data science applica- tions, can be exploited to improve healthcare. To foster the development and use of data science applications in healthcare, there is a fundamen- tal need for access to EHR data, which is typically not readily available to researchers and developers. A relatively rare exception is the large EHR database, the Stockholm EPR Corpus, comprising data from more than two million patients, that has been been made available to a lim- ited group of researchers at Stockholm University. Here, we describe a number of data science applications that have been developed using this database, demonstrating the potential reuse of EHR data to support healthcare and public health activities, as well as facilitate medical re- search. However, in order to realize the full potential of this resource, it needs to be made available to a larger community of researchers, as well as to industry actors. To that end, we envision the provision of an in- frastructure around this database called HEALTH BANK – the Swedish Health Record Research Bank. It will function both as a workbench for the development of data science applications and as a data explo- ration tool, allowing epidemiologists, pharmacologists and other medical researchers to generate and evaluate hypotheses. Aggregated data will be fed into a pipeline for open e-access, while non-aggregated data will be provided to researchers within an ethical permission framework. We believe that HEALTH BANK has the potential to promote a growing industry around the development of data science applications that will ultimately increase the efficiency and effectiveness of healthcare.

  • 5.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Swedification patterns of Latin and Greek affixes in clinical text2016In: Nordic Journal of Linguistics, ISSN 0332-5865, E-ISSN 1502-4717, Vol. 39, no 1, 5-37 p.Article in journal (Refereed)
    Abstract [en]

    Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes.

  • 6.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare2015Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

    All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

    The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

  • 7.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Learning multiple distributed prototypes of semantic categories for named entity recognition2015In: International Journal of Data Mining and Bioinformatics, ISSN 1748-5681, Vol. 13, no 4, 395-411 p.Article in journal (Refereed)
    Abstract [en]

    The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied – the size of the context window – and an experimental investigation shows that this leads to further performance gains.

  • 8.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Semantic Spaces of Clinical Text: Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records2013Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain.

    Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications.

    The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated.

  • 9.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Conway, Mike
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Chapman, Wendy W.
    Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records2013In: AMIA Conference Proceedings Archive, American Medical Informatics Association , 2013, 600-609 p.Conference paper (Refereed)
    Abstract [en]

    Medical terminologies and ontologies are important tools for natural language processing of health record narratives. To account for the variability of language use, synonyms need to be stored in a semantic resource as textual instantiations of a concept. Developing such resources manually is, however, prohibitively expensive and likely to result in low coverage. To facilitate and expedite the process of lexical resource development, distributional analysis of large corpora provides a powerful data-driven means of (semi-)automatically identifying semantic relations, including synonymy, between terms. In this paper, we demonstrate how distributional analysis of a large corpus of electronic health records – the MIMIC-II database – can be employed to extract synonyms of SNOMED CT preferred terms. A distinctive feature of our method is its ability to identify synonymous relations between terms of varying length.

  • 10.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kowalski, Stewart
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records2014In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, 450-457 p.Conference paper (Refereed)
    Abstract [en]

    Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

  • 11.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Hassel, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support2013In: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (Louhi 2013) / [ed] Hanna Suominen, NICTA , 2013Conference paper (Refereed)
    Abstract [en]

    In natural language processing, dimensionality reduction is a common technique to reduce complexity that simultaneously addresses the sparseness property of language. It is also used as a means to capture some latent structure in text, such as the underlying semantics. Dimensionality reduction is an important property of the word space model, not least in random indexing, where the dimensionality is a predefined model parameter. In this paper, we demonstrate the importance of dimensionality optimization and discuss correlations between dimensionality and the size of the vocabulary. This is of particular importance in the clinical domain, where the level of noise in the text leads to a large vocabulary; it may also mitigate the effect of exploding vocabulary sizes when modeling multiword terms as single tokens. A system that automatically assigns diagnosis codes to patient record entries is shown to improve by up to 18 percentage points by manually optimizing the dimensionality.

  • 12.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Identifying adverse drug event information in clinical notes with distributional semantic representations of context2015In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, 333-349 p.Article in journal (Refereed)
    Abstract [en]

    For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

  • 13.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska University Hospital.
    Hassel, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text2012In:  , 2012Conference paper (Refereed)
    Abstract [en]

    A novel method for identifying potential side-effects to medications through large-scale analysis of clinical data is here introduced and evaluated. By calculating distributional similarities for medication-symptom pairs based on co-occurrence information in a large clinical corpus, many known adverse drug reactions are successfully identified. These preliminary results suggest that semantic vector space models of clinical text could also be used to generate hypotheses about potentially unknown adverse drug reactions. In the best model, 50% of the terms in a list of twenty are considered to be conceivable side-effects. Among the medication-symptom pairs, however, diagnostic indications and terms related to the medication in other ways also appear. These relations need to be distinguished in a more refined method for detecting adverse drug reactions.

  • 14.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Moen, Hans
    Skeppstedt, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Daudaravičius, Vidas
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Synonym extraction and abbreviation expansion with ensembles of semantic spaces2014In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 5, no 6Article in journal (Refereed)
    Abstract [en]

    Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

  • 15.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Moen, Hans
    Skeppstedt, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Eklund, Ann-Marie
    Daudaravičius, Vidas
    Hassel, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models2012In: Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), 2012, 10-17 p.Conference paper (Refereed)
    Abstract [en]

    In information extraction, it is useful to know if two signifiers have the same or very similar semantic content. Maintaining such information in a controlled vocabulary is, however, costly. Here it is demonstrated how synonyms of medical terms can be extracted automatically from a large corpus of clinical text using distributional semantics. By combining Random Indexing and Random Permutation, different lexical semantic aspects are captured, effectively increasing our ability to identify synonymic relations between terms. 44% of 340 synonym pairs from MeSH are successfully extracted in a list of ten suggestions. The models can also be used to map abbreviations to their full-length forms; simple pattern-based filtering of the suggestions yields substantial improvements.

  • 16.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Skeppstedt, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institute, Sweden.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Conway, Mike
    Corpus-Driven Terminology Development: Populating Swedish SNOMED CT with Synonyms Extracted from Electronic Health Records2013In: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing (BioNLP 2013), Association for Computational Linguistics, 2013, 36-44 p.Conference paper (Refereed)
    Abstract [en]

    The various ways in which one can refer to the same clinical concept needs to be accounted for in a semantic resource such as SNOMED CT. Developing terminological resources manually is, however, prohibitively expensive and likely to result in low coverage, especially given the high variability of language use in clinical text. To support this process, distributional methods can be employed in conjunction with a large corpus of electronic health records to extract synonym candidates for clinical terms. In this paper, we exemplify the potential of our proposed method using the Swedish version of SNOMED CT, which currently lacks synonyms. A medical expert inspects two thousand term pairs generated by two semantic spaces -- one of which models multiword terms in addition to single words -- for one hundred preferred terms of the semantic types disorder and finding.

  • 17.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Zhao, Jing
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015In: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, 343-350 p.Conference paper (Refereed)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 18.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Zhao, Jing
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event DetectionIn: IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE conference proceedingsConference paper (Refereed)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 19.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Zhao, Jing
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Ensembles of randomized trees using diverse distributed representations of clinical events2016In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, 69Article in journal (Refereed)
    Abstract [en]

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

  • 20.
    Skeppstedt, Maria
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Ahltorp, Magnus
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Vocabulary Expansion by Semantic Extraction of Medical Terms2013In: Proceedings of the 5th International Symposiumon Languages in Biology and Medicine, 2013, 63-68 p.Conference paper (Refereed)
    Abstract [en]

    Automatic methods for vocabulary expansion are valuable in supporting the development of terminological resources. Here, we evaluate two methods based on distributional semantics for extracting terms that belong to a certain semantic category. In a list of 1000 terms extracted from a corpus of Swedish medical text, the best method obtains a recall of 0.53 and 0.88, respectively, for identifying 90 terms that are known to belong to the semantic categories Medical Finding and Pharmaceutical Drug.

  • 21. Tengstrand, Lisa
    et al.
    Megyesi, Beata
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    EACL - Expansion of Abbreviations in CLinical text2014In: Proceedings of the 3rdWorkshop on Predicting and Improving Text Readability for Target Reader Population, Association for Computational Linguistics , 2014Conference paper (Refereed)
    Abstract [en]

    In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians’ notes, abbreviations need to be identified and expanded to their original forms. We present a distributional semantic approach to find candidates of the original form of the abbreviation, and combine this with Levenshtein distance to choose the correct candidate among the semantically related words. We apply the method to radiology reports and medical journal texts, and compare the results to general Swedish. The results show that the correct expansion of the abbreviation can be found in 40% of the cases, an improvement by 24 percentage points compared to the baseline (0.16), and an increase by 22 percentage points compared to using word space models alone (0.18).

  • 22.
    Velupillai, Sumithra
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, MartinStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Henriksson, AronStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Kvist, MariaStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Skeppstedt, MariaStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Dalianis, HerculesStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Louhi 2014: Special issue on health text mining and information analysis2015Conference proceedings (editor) (Refereed)
  • 23.
    Velupillai, Sumithra
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Skeppstedt, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Louhi 2014: Special issue on health text mining and information analysis: introduction2015In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 2, no SI, 1-3 p.Article in journal (Refereed)
  • 24.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Learning temporal weights of clinical events using variable importance2016In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, no Suppl. 2, 71Article in journal (Refereed)
    Abstract [en]

    Background: Longitudinal data sources, such as electronic health records (EHRs), are very valuable for monitoring adverse drug events (ADEs). However, ADEs are heavily under-reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account the temporality of clinical events, which are time stamped in EHRs, and providing these as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, the weights were in that case pre-assigned according to their time stamps, which is limited and potentially less accurate. This study therefore focuses on how to learn weights that effectively take into account the temporality and importance of clinical events for ADE detection. Methods: Variable importance obtained from the random forest learning algorithm is used for extracting temporal weights. Two strategies are proposed for applying the learned weights: weighted aggregation and weighted sampling. The first strategy aggregates the weighted clinical events from different time windows to form new features; the second strategy retains the original features but samples them by using their weights as probabilities when building each tree in the forest. The predictive performance of random forest models using the learned weights with the two strategies is compared to using pre-assigned weights. In addition, to assess the sensitivity of the weight-learning procedure, weights from different granularity levels are evaluated and compared. Results: In the weighted sampling strategy, using learned weights significantly improves the predictive performance, in comparison to using pre-assigned weights; however, there is no significant difference between them in the weighted aggregation strategy. Moreover, the granularity of the weight learning procedure has a significant impact on the former, but not on the latter. Conclusions: Learning temporal weights is significantly beneficial in terms of predictive performance with the weighted sampling strategy. Moreover, weighted aggregation generally diminishes the impact of temporal weighting of the clinical events, irrespective of whether the weights are pre-assigned or learned.

  • 25.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Asker, Lars
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements2014In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): Proceedings, IEEE Computer Society, 2014, 536-543 p.Conference paper (Refereed)
    Abstract [en]

    Adverse drug events (ADEs) are grossly under-reported in electronic health records (EHRs). This could be mitigated by methods that are able to detect ADEs in EHRs, thereby allowing for missing ADE-specific diagnosis codes to be identified and added. A crucial aspect of constructing such systems is to find proper representations of the data in order to allow the predictive modeling to be as accurate as possible. One category of EHR data that can be used as indicators of ADEs are clinical measurements. However, using clinical measurements as features is not unproblematic due to the high rate of missing values and they can be repeated a variable number of times in each patient health record. In this study, five basic representations of clinical measurements are proposed and evaluated to handle these two problems. An empirical investigation using random forest on 27 datasets from a real EHR database with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, is higher when representing clinical measurements crudely as whether they were taken or how many times they were taken by a patient. Furthermore, a sixth alternative, combining all five basic representations, significantly outperforms using any of the basic representation except for one. A subsequent analysis of variable importance is also conducted with this fused feature set, showing that when clinical measurements have a high missing rate, the number of times they were taken by one patient is ranked as more informative than looking at their actual values. The observation from random forest is also confirmed empirically using other commonly employed classifiers. This study demonstrates that the way in which clinical measurements from EHRs are presented has a high impact for ADE detection, and that using multiple representations outperforms using a basic representation.

  • 26.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Asker, Lars
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Predictive modeling of structured electronic health records for adverse drug event detection2015In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 15, no SIArticle in journal (Refereed)
    Abstract [en]

    Background: The digitization of healthcare data, resulting from the increasingly widespread adoption of electronic health records, has greatly facilitated its analysis by computational methods and thereby enabled large-scale secondary use thereof. This can be exploited to support public health activities such as pharmacovigilance, wherein the safety of drugs is monitored to inform regulatory decisions about sustained use. To that end, electronic health records have emerged as a potentially valuable data source, providing access to longitudinal observations of patient treatment and drug use. A nascent line of research concerns predictive modeling of healthcare data for the automatic detection of adverse drug events, which presents its own set of challenges: it is not yet clear how to represent the heterogeneous data types in a manner conducive to learning high-performing machine learning models. Methods: Datasets from an electronic health record database are used for learning predictive models with the purpose of detecting adverse drug events. The use and representation of two data types, as well as their combination, are studied: clinical codes, describing prescribed drugs and assigned diagnoses, and measurements. Feature selection is conducted on the various types of data to reduce dimensionality and sparsity, while allowing for an in-depth feature analysis of the usefulness of each data type and representation. Results: Within each data type, combining multiple representations yields better predictive performance compared to using any single representation. The use of clinical codes for adverse drug event detection significantly outperforms the use of measurements; however, there is no significant difference over datasets between using only clinical codes and their combination with measurements. For certain adverse drug events, the combination does, however, outperform using only clinical codes. Feature selection leads to increased predictive performance for both data types, in isolation and combined. Conclusions: We have demonstrated how machine learning can be applied to electronic health records for the purpose of detecting adverse drug events and proposed solutions to some of the challenges this presents, including how to represent the various data types. Overall, clinical codes are more useful than measurements and, in specific cases, it is beneficial to combine the two.

  • 27.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Cascading Adverse Drug Event Detection in Electronic Health Records2015In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): Proceedings, IEEE Computer Society, 2015Conference paper (Refereed)
    Abstract [en]

    The ability to detect adverse drug events (ADEs) in electronic health records (EHRs) is useful in many medical applications, such as alerting systems that indicate when an ADE-specific diagnosis code should be assigned. Automating the detection of ADEs can be attempted by applying machine learning to existing, labeled EHR data. How to do this in an effective manner is, however, an open question. The issues addressed in this study concern the granularity of the classification task: (1) If we wish to predict the occurrence of ADE, is it advantageous to conflate the various ADE class labels prior to learning, or should they be merged post prediction? (2) If we wish to predict a family of ADEs or even a specific ADE, can the predictive performance be enhanced by dividing the classification task into a cascading scheme: predicting first, on a coarse level, whether there is an ADE or not, and, in the former case, followed by a more specific prediction on which family the ADE belongs to, and then finally a prediction on the specific ADE within that particular family? In this study, we conduct a series of experiments using a real, clinical dataset comprising healthcare episodes that have been assigned one of eight ADE-related diagnosis codes and a set of randomly extracted episodes that have not been assigned any ADE code. It is shown that, when distinguishing between ADEs and non-ADEs, merging the various ADE labels prior to learning leads to significantly higher predictive performance in terms of accuracy and area under ROC curve. A cascade of random forests is moreover constructed to determine either the family of ADEs or the specific class label; here, the performance is indeed enhanced compared to directly employing a one-step prediction. This study concludes that, if predictive performance is of primary importance, the cascading scheme should be the recommended approach over employing a one-step prediction for detecting ADEs in EHRs.

  • 28.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes2014In: 2014 IEEE International Conference on Healthcare Informatics: Proceedings, IEEE Computer Society, 2014, 285-293 p.Conference paper (Refereed)
    Abstract [en]

    Electronic health records (EHRs) provide a potentially valuable source of information for pharmacovigilance. However, adverse drug events (ADEs), which can be encoded in EHRs with specific diagnosis codes, are heavily under-reported. To provide more accurate estimates for drug safety surveillance, machine learning systems that are able to detect ADEs could be used to identify and suggest missing ADE-specific diagnosis codes. A fundamental consideration when building such systems is how to represent the EHR data to allow for accurate predictive modeling. In this study, two types of clinical code are used to represent drugs and diagnoses: the Anatomical Therapeutic Chemical Classification System (ATC) and the International Statistical Classification of Diseases and Health Problems (ICD). More specifically, it is investigated whether their hierarchical structure can be exploited to improve predictive performance. The use of random forests with feature sets that include only the original, low-level, codes is compared to using random forests with feature sets that contain all levels in the hierarchies. An empirical investigation using thirty datasets with different ADE targets is presented, demonstrating that the predictive performance, in terms of accuracy and area under ROC curve, can be significantly improved by exploiting codes on all levels in the hierarchies, compared to using only the low-level encoding. A further analysis is presented in which two strategies are employed for adding features level-wise according to the concept hierarchies: top-down, starting with the highest abstraction levels, and bottom-up, starting with the most specific encoding. The main finding from this subsequent analysis is that predictive performance can be kept at a high level even without employing the more specific levels in the concept hierarchies.

  • 29.
    Zhao, Jing
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institute, Sweden.
    Asker, Lars
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Handling Temporality of Clinical Events for Drug Safety Surveillance2015In: AMIA Annual Symposium Proceedings, ISSN 1559-4076, Vol. 2015, 1371-1380 p.Article in journal (Refereed)
    Abstract [en]

    Using longitudinal data in electronic health records (EHRs) for post-marketing adverse drug event (ADE) detection allows for monitoring patients throughout their medical history. Machine learning methods have been shown to be efficient and effective in screening health records and detecting ADEs. How best to exploit historical data, as encoded by clinical events in EHRs is, however, not very well understood. In this study, three strategies for handling temporality of clinical events are proposed and evaluated using an EHR database from Stockholm, Sweden. The random forest learning algorithm is applied to predict fourteen ADEs using clinical events collected from different lengths of patient history. The results show that, in general, including longer patient history leads to improved predictive performance, and that assigning weights to events according to time distance from the ADE yields the biggest improvement.

1 - 29 of 29
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf