Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
2015 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

Ort, förlag, år, upplaga, sidor
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2015. , 95 s.
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 15-021
Nyckelord [en]
natural language processing, machine learning, distributional semantics, ensemble learning, semantic space ensembles, medical informatics, electronic health records
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-122465ISBN: 978-91-7649-302-1 (tryckt)OAI: oai:DiVA.org:su-122465DiVA: diva2:866466
Disputation
2015-12-17, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (Engelska)
Opponent
Handledare
Projekt
High-Performance Data Mining for Drug Effect Detection
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Anmärkning

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4 and 5: Unpublished conference papers.

Tillgänglig från: 2015-11-25 Skapad: 2015-11-02 Senast uppdaterad: 2015-11-13Bibliografiskt granskad
Delarbeten
1. Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Öppna denna publikation i ny flik eller fönster >>Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Visa övriga...
2014 (Engelska)Ingår i: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 5, nr 6Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

Nyckelord
distributional semantics, random indexing, semantic space, ensemble methods, synonym extraction, abbreviation expansion
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-108651 (URN)10.1186/2041-1480-5-6 (DOI)000343707900002 ()
Tillgänglig från: 2014-10-31 Skapad: 2014-10-31 Senast uppdaterad: 2015-11-03Bibliografiskt granskad
2. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records
Öppna denna publikation i ny flik eller fönster >>Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records
2014 (Engelska)Ingår i: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, 450-457 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

Ort, förlag, år, upplaga, sidor
IEEE conference proceedings, 2014
Nyckelord
distributional semantics, random indexing, named entity recognition, deidentification, electronic health records
Nationell ämneskategori
Systemvetenskap, informationssystem och informatik
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-110992 (URN)10.1109/BIBM.2014.6999199 (DOI)978-1-4799-5701-9/14 (ISBN)
Konferens
2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2-5 November 2014, Belfast, UK
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2014-12-19 Skapad: 2014-12-19 Senast uppdaterad: 2015-11-03Bibliografiskt granskad
3. Learning multiple distributed prototypes of semantic categories for named entity recognition
Öppna denna publikation i ny flik eller fönster >>Learning multiple distributed prototypes of semantic categories for named entity recognition
2015 (Engelska)Ingår i: International Journal of Data Mining and Bioinformatics, ISSN 1748-5681, Vol. 13, nr 4, 395-411 s.Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied – the size of the context window – and an experimental investigation shows that this leads to further performance gains.

Nyckelord
distributional semantics, semantic space ensembles, random indexing, named entity recognition, electronic health records, de-identification
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122461 (URN)10.1504/IJDMB.2015.072766 (DOI)000366135400005 ()
Projekt
High-Performance Data Mining for Drug Effect Detection
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2015-11-02 Skapad: 2015-11-02 Senast uppdaterad: 2016-01-04Bibliografiskt granskad
4. Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
Öppna denna publikation i ny flik eller fönster >>Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
(Engelska)Ingår i: IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE conference proceedingsKonferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

Ort, förlag, år, upplaga, sidor
IEEE conference proceedings
Nyckelord
distributional semantics, semantic space ensembles, heterogeneous data, electronic health records, adverse drug events, predictive modeling
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122462 (URN)
Konferens
IEEE International Conference on Data Science and Advanced Analytics (DSAA), 19-21 October, Paris
Projekt
High-Performance Data Mining for Drug Effect Detection
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2015-11-02 Skapad: 2015-11-02 Senast uppdaterad: 2015-11-03
5. Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
Öppna denna publikation i ny flik eller fönster >>Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
2015 (Engelska)Ingår i: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, 343-350 s.Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

Ort, förlag, år, upplaga, sidor
IEEE Computer Society, 2015
Nyckelord
distributional semantics, semantic space ensembles, ensemble models, electronic health records, adverse drug events, predictive modeling, information fusion
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling) Datavetenskap (datalogi)
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122463 (URN)10.1109/BIBM.2015.7359705 (DOI)
Konferens
IEEE BIBM, International Conference on Bioinformatics and Biomedicine, U.S.A, Washington, D.C., 09-12 November 2015
Projekt
High-Performance Data Mining for Drug Effect Detection
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2015-11-02 Skapad: 2015-11-02 Senast uppdaterad: 2017-01-16Bibliografiskt granskad
6. Identifying adverse drug event information in clinical notes with distributional semantic representations of context
Öppna denna publikation i ny flik eller fönster >>Identifying adverse drug event information in clinical notes with distributional semantic representations of context
2015 (Engelska)Ingår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, 333-349 s.Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

Nyckelord
adverse drug events, electronic health records, corpus annotation, machine learning, distributional semantics, relation extraction
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-122464 (URN)10.1016/j.jbi.2015.08.013 (DOI)000363437500028 ()
Projekt
High-Performance Data Mining for Drug Effect Detection
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), IIS11-0053
Tillgänglig från: 2015-11-02 Skapad: 2015-11-02 Senast uppdaterad: 2015-11-23Bibliografiskt granskad

Open Access i DiVA

Ensembles of Semantic Spaces(890 kB)279 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 890 kBChecksumma SHA-512
93188938b53753234d8db6298c3e1bf345313ff81e168616e00b80b5f60108c20e7c93f2af6f538c4f458361848158b17420917ca5dea47958e1201a472c3f62
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Henriksson, Aron
Av organisationen
Institutionen för data- och systemvetenskap
Datavetenskap (datalogi)Språkteknologi (språkvetenskaplig databehandling)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 279 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

Totalt: 1291 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf