Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2015. , 95 p.
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 15-021
Keyword [en]
natural language processing, machine learning, distributional semantics, ensemble learning, semantic space ensembles, medical informatics, electronic health records
National Category
Computer Science Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-122465ISBN: 978-91-7649-302-1 (print)OAI: oai:DiVA.org:su-122465DiVA: diva2:866466
Public defence
2015-12-17, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (English)
Opponent
Supervisors
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4 and 5: Unpublished conference papers.

Available from: 2015-11-25 Created: 2015-11-02 Last updated: 2015-11-13Bibliographically approved
List of papers
1. Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Open this publication in new window or tab >>Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Show others...
2014 (English)In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 5, no 6Article in journal (Refereed) Published
Abstract [en]

Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

Keyword
distributional semantics, random indexing, semantic space, ensemble methods, synonym extraction, abbreviation expansion
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-108651 (URN)10.1186/2041-1480-5-6 (DOI)000343707900002 ()
Available from: 2014-10-31 Created: 2014-10-31 Last updated: 2015-11-03Bibliographically approved
2. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records
Open this publication in new window or tab >>Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records
2014 (English)In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, 450-457 p.Conference paper (Refereed)
Abstract [en]

Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Keyword
distributional semantics, random indexing, named entity recognition, deidentification, electronic health records
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-110992 (URN)10.1109/BIBM.2014.6999199 (DOI)978-1-4799-5701-9/14 (ISBN)
Conference
2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2-5 November 2014, Belfast, UK
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2014-12-19 Created: 2014-12-19 Last updated: 2015-11-03Bibliographically approved
3. Learning multiple distributed prototypes of semantic categories for named entity recognition
Open this publication in new window or tab >>Learning multiple distributed prototypes of semantic categories for named entity recognition
2015 (English)In: International Journal of Data Mining and Bioinformatics, ISSN 1748-5681, Vol. 13, no 4, 395-411 p.Article in journal (Refereed) Published
Abstract [en]

The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied – the size of the context window – and an experimental investigation shows that this leads to further performance gains.

Keyword
distributional semantics, semantic space ensembles, random indexing, named entity recognition, electronic health records, de-identification
National Category
Computer Science Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122461 (URN)10.1504/IJDMB.2015.072766 (DOI)000366135400005 ()
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2016-01-04Bibliographically approved
4. Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
Open this publication in new window or tab >>Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
(English)In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE conference proceedingsConference paper (Refereed)
Abstract [en]

The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

Place, publisher, year, edition, pages
IEEE conference proceedings
Keyword
distributional semantics, semantic space ensembles, heterogeneous data, electronic health records, adverse drug events, predictive modeling
National Category
Computer Science Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122462 (URN)
Conference
IEEE International Conference on Data Science and Advanced Analytics (DSAA), 19-21 October, Paris
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2015-11-03
5. Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
Open this publication in new window or tab >>Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
2015 (English)In: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, 343-350 p.Conference paper (Refereed)
Abstract [en]

Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Keyword
distributional semantics, semantic space ensembles, ensemble models, electronic health records, adverse drug events, predictive modeling, information fusion
National Category
Language Technology (Computational Linguistics) Computer Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122463 (URN)10.1109/BIBM.2015.7359705 (DOI)
Conference
IEEE BIBM, International Conference on Bioinformatics and Biomedicine, U.S.A, Washington, D.C., 09-12 November 2015
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2017-01-16Bibliographically approved
6. Identifying adverse drug event information in clinical notes with distributional semantic representations of context
Open this publication in new window or tab >>Identifying adverse drug event information in clinical notes with distributional semantic representations of context
2015 (English)In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, 333-349 p.Article in journal (Refereed) Published
Abstract [en]

For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

Keyword
adverse drug events, electronic health records, corpus annotation, machine learning, distributional semantics, relation extraction
National Category
Computer Science Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122464 (URN)10.1016/j.jbi.2015.08.013 (DOI)000363437500028 ()
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2015-11-23Bibliographically approved

Open Access in DiVA

Ensembles of Semantic Spaces(890 kB)256 downloads
File information
File name FULLTEXT01.pdfFile size 890 kBChecksum SHA-512
93188938b53753234d8db6298c3e1bf345313ff81e168616e00b80b5f60108c20e7c93f2af6f538c4f458361848158b17420917ca5dea47958e1201a472c3f62
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Henriksson, Aron
By organisation
Department of Computer and Systems Sciences
Computer ScienceLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 256 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 1235 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf