Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Mining Clinical Text in Cancer Care
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Health care and clinical practice generate large amounts of text detailing symptoms, test results, diagnoses, treatments, and outcomes for patients. This clinical text, documented in health records, is a potential source of knowledge and an underused resource for improved health care. The focus of this work has been text mining of clinical text in the domain of cancer care, with the aim to develop and evaluate methods for extracting relevant information from such texts. Two different types of clinical documentation have been included: clinical notes from electronic health records in Swedish and Norwegian pathology reports.

Free text, and clinical text in particular, is considered as a kind of unstructured information, which is difficult to process automatically. Therefore, information extraction can be applied to create a more structured representation of a text, making its content more accessible for machine learning and statistics. To this end, this thesis describes the development of an efficient and accurate tool for information extraction for pathology reports.

Another application for clinical text mining is risk prediction and diagnosis prediction. The goal for such prediction is to create a machine learning model capable of identifying patients at risk of a specific disease or some other adverse outcome. The motivation for cancer diagnosis prediction is that an early diagnosis can be beneficial for the outcome of treatment. Here, a disease prediction model was developed and evaluated for prediction of cervical cancer. To create this model, health records of patients diagnosed with cervical cancer were processed in two steps. First, clinical events were extracted from free text clinical notes through the use of named entity recognition. The extracted events were next combined with other event types, such as diagnosis codes and drug codes from the same health records. Finally, machine learning models were trained for predicting cervical cancer, and evaluation showed that events extracted from the free text records were the most informative event type for the diagnosis prediction.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2020. , p. 64
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 20-001
Keywords [en]
text mining, natural language processing, electronic health records, clinical text mining, information extraction
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-176282ISBN: 978-91-7797-911-1 (print)ISBN: 978-91-7797-912-8 (electronic)OAI: oai:DiVA.org:su-176282DiVA, id: diva2:1373982
Public defence
2020-01-27, L30, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4: Accepted. Paper 5: Submitted.

Available from: 2019-12-19 Created: 2019-11-28 Last updated: 2019-12-17Bibliographically approved
List of papers
1. Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx
Open this publication in new window or tab >>Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx
Show others...
2015 (English)In: AMIA Annual Symposium Proceedings, American Medical Informatics Association , 2015, p. 1296-1305Conference paper, Published paper (Refereed)
Abstract [en]

Detection of early symptoms in cervical cancer is crucial for early treatment and survival. To find symptoms of cervical cancer in clinical text, Named Entity Recognition is needed. In this paper the Clinical Entity Finder, a machine-learning tool trained on annotated clinical text from a Swedish internal medicine emergency unit, is evaluated on cervical cancer records. The Clinical Entity Finder identifies entities of the types body part, finding and disorder and is extended with negation detection using the rule-based tool NegEx, to distinguish between negated and non-negated entities. To measure the performance of the tools on this new domain, two physicians annotated a set of clinical notes from the health records of cervical cancer patients. The inter-annotator agreement for finding, disorder and body part obtained an average F-score of 0.677 and the Clinical Entity Finder extended with NegEx had an average F-score of 0.667.

Place, publisher, year, edition, pages
American Medical Informatics Association, 2015
Series
AMIA Annual Symposium Proceedings, ISSN 1559-4076, E-ISSN 1942-597X
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-123947 (URN)26958270 (PubMedID)
Conference
AMIA 2015 Annual Symposium, San Francisco, CA, November 14 - 18, 2015
Available from: 2015-12-09 Created: 2015-12-09 Last updated: 2019-12-16Bibliographically approved
2. Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora
Open this publication in new window or tab >>Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora
Show others...
2017 (English)In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 71, p. 16-30Article in journal (Refereed) Published
Abstract [en]

Objective: The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations. Methods: The significance of this work stands on its experimental layout. The experiments were carried out under the same conditions for both languages. Several classification approaches were explored: maximum probability, CRF, Perceptron and SVM. The classifiers were enhanced by means of ensembles of semantic spaces and ensembles of Brown trees. In order to mitigate sparsity of data, without a significant increase in the dimension of the decision space, we propose the use of clustered approaches of the hierarchical Brown clustering represented by trees and vector quantization for each semantic space. Results: The results showed that the semi-supervised approaches significantly improved standard supervised techniques for both languages. Moreover, clustering the semantic spaces contributed to the quality of the entity recognition while keeping the dimension of the feature-space two orders of magnitude lower than when directly using the semantic spaces. Conclusions: The contributions of this study are: (a) a set of thorough experiments that enable comparisons regarding the influence of different types of features on different classifiers, exploring two languages other than English; and (b) the use of ensembles of clusters of Brown trees and semantic spaces on EHRs to tackle the problem of scarcity of available annotated data.

Keywords
Medical entity recognition, Supervised and unsupervised learning, Health records
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-147956 (URN)10.1016/j.jbi.2017.05.009 (DOI)000409394600002 ()28526460 (PubMedID)
Available from: 2017-10-17 Created: 2017-10-17 Last updated: 2019-12-02Bibliographically approved
3. Efficient Encoding of Pathology Reports Using Natural Language Processing
Open this publication in new window or tab >>Efficient Encoding of Pathology Reports Using Natural Language Processing
2017 (English)In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017 / [ed] Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Ivelina Nikolova, Irina Temnikova, Association for Computational Linguistics, 2017, p. 778-783Conference paper, Published paper (Refereed)
Abstract [en]

In this article we present a system that extracts information from pathology reports. The reports are written in Norwegian and contain free text describing prostate biopsies. Currently, these reports are manually coded for research and statistical purposes by trained experts at the Cancer Registry of Norway where the coders extract values for a set of predefined fields that are specific for prostate cancer. The presented system is rule based and achieves an average F-score of 0.91 for the fields Gleason grade, Gleason score, the number of biopsies that contain tumor tissue, and the orientation of the biopsies. The system also identifies reports that contain ambiguity or other content that should be reviewed by an expert. The system shows potential to encode the reports considerably faster, with less resources, and similar high quality to the manual encoding.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2017
Keywords
information extraction, natural language processing
National Category
Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-150182 (URN)10.26615/978-954-452-049-6_100 (DOI)978-954-452-048-9 (ISBN)978-954-452-049-6 (ISBN)
Conference
International Conference on Recent Advances in Natural Language Processing (RANLP '17), Varna, Bulgaria, 2-8 September, 2017
Available from: 2017-12-13 Created: 2017-12-13 Last updated: 2019-11-28Bibliographically approved
4. Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches
Open this publication in new window or tab >>Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches
2019 (English)In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 14, article id 274Article in journal (Refereed) Published
Abstract [en]

Background

Text mining and natural language processing of clinical text, such as notes from electronic health records, requires specific consideration of the specialized characteristics of these texts. Deep learning methods could potentially mitigate domain specific challenges such as limited access to in-domain tools and data sets.

Methods

A bi-directional Long Short-Term Memory network is applied to clinical notes in Spanish and Swedish for the task of medical named entity recognition. Several types of embeddings, both generated from in-domain and out-of-domain text corpora, and a number of generation and combination strategies for embeddings have been evaluated in order to investigate different input representations and the influence of domain on the final results.

Results

For Spanish, a micro averaged F1-score of 75.25 was obtained and for Swedish, the corresponding score was 76.04. The best results for both languages were achieved using embeddings generated from in-domain corpora extracted from electronic health records, but embeddings generated from related domains were also found to be beneficial.

Conclusions

A recurrent neural network with in-domain embeddings improved the medical named entity recognition compared to shallow learning methods, showing this combination to be suitable for entity recognition in clinical text for both languages.

Keywords
Clinical text mining, Unstructured electronic health records, Medical named entity recognition, Recurrent neural network
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-176296 (URN)10.1186/s12911-019-0981-y (DOI)
Available from: 2019-11-28 Created: 2019-11-28 Last updated: 2019-12-25Bibliographically approved
5. Learning to predict cervical cancer from Swedish health records by mining hierarchal representations
Open this publication in new window or tab >>Learning to predict cervical cancer from Swedish health records by mining hierarchal representations
2019 (English)In: Article in journal (Refereed) Submitted
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-176297 (URN)
Available from: 2019-11-28 Created: 2019-11-28 Last updated: 2019-12-02Bibliographically approved

Open Access in DiVA

Mining Clinical Text in Cancer Care(1920 kB)38 downloads
File information
File name FULLTEXT01.pdfFile size 1920 kBChecksum SHA-512
49414b0e85877c1678d19c1c44eb4db64363272aea9cd634f94006fbc1848de20ac66e5437a6f2052369925daa067959b1210f14190db0041d8bc6077f68349b
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Weegar, Rebecka
By organisation
Department of Computer and Systems Sciences
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 38 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 201 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf