Ändra sökning
Avgränsa sökresultatet
12 1 - 50 av 58
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. Ahltorp, Magnus
    et al.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Using text prediction for facilitating input and improving readability of clinical text2013Ingår i: MedInfo 2013: Proceedings of the 14th World Congress on Medical and Health Informatics / [ed] Christoph Ulrich Lehmann, Elske Ammenwerth, Christian Nøhr, IOS Press, 2013, 1149-1149 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Text prediction has the potential for facilitating and speeding up the documentation work within health care, making it possible for health personnel to allocate less time to documentation and more time to patient care. It also offers a way to produce clinical text with fewer misspellings and abbreviations, increasing readability. We have explored how text prediction can be used for input of clinical text, and how the specific challenges of text prediction in this domain can be addressed. A text prediction prototype was constructed using data from a medical journal and from medical terminologies. This prototype achieved keystroke savings of 26% when evaluated on texts mimicking authentic clinical text. The results are encouraging, indicating that there are feasible methods for text prediction in the clinical domain.

  • 2.
    Alfalahi, Alyaa
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Brissman, Sara
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Pseudonymisation of Personal Names and other PHIs in an Annotated Clinical Swedish Corpus2012Ingår i: LREC 2012, Eighth International Conference on Language Resources and Evaluation / [ed] Nicoletta Calzolari et al., 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    Today a large number of patient records are produced and these records contain valuable information, often in free text, about the medical treatment of patients. Since these records contain information that can reveal the identity of patients, known as protected health information (PHI), the records cannot easily be made available for the research community. In this research we have used a PHI annotated clinical corpora, written in Swedish, that we have pseudonymised. Pseudonymisation means to replace the sensitive information with fictive information for example real personal names are replaced with fictive personal names based on the gender of the real names and family relations. We have evaluated our results and our five respondents of who three were clinicians found that the clinical text looks real and is readable. We have also added pseudonymisation for telephone numbers, locations, health care units, dates and ages. In this paper we also present the entire de-identification and pseudonymisation process of a sample clinical text.

  • 3.
    Allvin, Helen
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Carlsson, Elin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Danielsson-Ojala, Riitta
    Daudaravieius, Vidas
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kokkinakis, Dimitrios
    Lundgrén-Laine, Heljä
    Nilsson, Gunnar H.
    Nytrø, Øystein
    Salanterä, Sanna
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Suominen, Hanna
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies2011Ingår i: Journal of Biomedical Semantics, ISSN 2041-1480, Vol. 2, nr S1, 1-11 s.Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Free text is helpful for entering information into electronic health records, but reusing it is a challenge. The need for language technology for processing Finnish and Swedish healthcare text is therefore evident; however, Finnish and Swedish are linguistically very dissimilar. In this paper we present a comparison of characteristics in Finnish and Swedish free-text nursing narratives from intensive care. This creates a framework for characterising and comparing clinical text and lays the groundwork for developing clinical language technologies. Methods: Our material included daily nursing narratives from one intensive care unit in Finland and one in Sweden. Inclusion criteria for patients were an inpatient period of least five days and an age of at least 16 years. We performed a comparative analysis as part of a collaborative effort between Finnish- and Swedish-speaking healthcare and language technology professionals that included both qualitative and quantitative aspects. The qualitative analysis addressed the content and structure of three average- sized health records from each country. In the quantitative analysis 514 Finnish and 379 Swedish health records were studied using various language technology tools. Results: Although the two languages are not closely related, nursing narratives in Finland and Sweden had many properties in common. Both made use of specialised jargon and their content was very similar. However, many of these characteristics were challenging regarding development of language technology to support producing and using clinical documentation. Conclusions: The way Finnish and Swedish intensive care nursing was documented, was not country or language dependent, but shared a common context, principles and structural features and even similar vocabulary elements. Technology solutions are therefore likely to be applicable to a wider range of natural languages, but they need linguistic tailoring. Availability: The Finnish and Swedish data can be found at: http://www.dsv.su.se/ hexanord/data/

  • 4.
    Boström, Henrik
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    De-identifying health records by means of active learning2012Ingår i:  , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    An experiment on classifying words in Swedish health records as belonging to one of eight protected health information (PHI) classes, or to the non-PHI class, by means of active learning has been conducted, in which three selection strategies were evaluated in conjunction with random forests; the commonly employed approach of choosing the most uncertain examples, choosing randomly, and choosing the most certain examples. Surprisingly, random selection outperformed choosing the most uncertain examples with respect to ten considered performance metrics. Moreover, choosing the most certain examples outperformed random selection with respect to nine out of ten metrics.

  • 5. Caccamisi, A.
    et al.
    Jørgensen, L.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rosenlund, M.
    Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing2016Ingår i: Value in Health, ISSN 1098-3015, E-ISSN 1524-4733, Vol. 19, nr 7, A373Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Objectives

    To develop a machine learning algorithm for automatic classification of smoking status (smoker, ex-smoker, non-smoker and unknown status) in EMRs, and validate the predictive accuracy compared to a rule-based method. Smoking is a leading cause of death worldwide and may introduce confounding in research based on real world data (RWD). Information on smoking is often documented in free text fields in Electronic Medical Records (EMRs), but structured RWD on smoking is sparse.

    Methods

    32 predictive models were trained with the Weka machine learning suite, tweaking sentence frequency, classifier type, tokenization and attribute selection using a database of 85,000 classified sentences. The models were evaluated using F-Score and Accuracy based on out-of-sample test data including 8,500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the models confusion matrices.

    Results

    The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a polynomial kernel with parameter C equal to 6 and a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.25% accuracy and 0.982 F-Score versus 79.32% and 0.756, respectively, for the rule-based model.

    Conclusions

    A model using machine learning algorithms to automatically classify patients smoking status was successfully developed. This algorithm would enable automatic assessment of smoking status directly from EMRs, obviating the need to extract complete case notes and manual classification.

  • 6. Caccamisi, Andrea
    et al.
    Jørgensen, Leif
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rosenlund, Mats
    Automatic extraction and translation of the patients smoking status from free text using natural language processing2016Konferensbidrag (Refereegranskat)
  • 7.
    Carlsson, Elin
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish2010Ingår i: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 19-21, 2010, European Language Resources Association (ELRA) , 2010, 3442-3446 s.Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Electronic patient records (EPRs) are a valuable resource for research but for confidentiality reasons they cannot be used freely. In order to make EPRs available to a wider group of researchers, sensitive information such as personal names has to be removed. Deidentification is a process that makes this possible. Both rule-based as well as statistical and machine learning based methods exist to perform de-identification, but the second method requires annotated training material which exists only very sparsely for patient names. It is therefore necessary to use rule-based methods for de-identification of EPRs. Not much is known, however, about the order in which the various rules should be applied and how the different rules influence precision and recall. This paper aims to answer this research question by implementing and evaluating four common rules for de-identification of personal names in EPRs written in Swedish: (1) dictionary name matching, (2) title matching, (3) common words filtering and (4) learning from previous modules. The results show that to obtain the highest recall and precision, the rules should be applied in the following order: title matching, common words filtering and dictionary name matching.

  • 8.
    Cerratto-Pargman, Teresa
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Knutsson, Ola
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Celikten, Emre
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Sneiders, Eriks
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    User Centered Development of Automatic E-mail Answering for the Public Sector2012Ingår i: Human-Computer Interaction, Tourism and Cultural Heritage / [ed] Francisco Cipolla-Ficarra, Kim Veltman, Huang Chih-Fang, Miguel Cipolla-Ficarra, Andreas Kratky, Springer Berlin/Heidelberg, 2012, Vol. 7546, 154-156 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    In Sweden, the use of e-mail by the public sector has become a key communication service between citizens and governmental authorities. Although the integration of e-mail in the public sector has certainly brought citizens and handling officers closer, it has also introduced a particular vision on governmental authorities such as for instance the idea that public service and information should be available to citizens any time, anywhere. Such a belief among citizens puts certainly high demands on the quality and efficiency of the e-service governmental authorities are capable to provide. In fact, the growing number of citizens’ electronic requests must be accurately answered in a limited time. In the research project IMAIL (Intelligent e-mail answering service for eGovernment) [1], we have focused on the work carried out at the Swedish Social Insurance Agency (SSIA) that exemplifies a governmental authority dealing with 500,000 emails per year on top of face-to face meetings, phone calls and chat communication. With the objective of creating an e-mail client capable to ease and ensure the quality of SSIAs’ handling officers public service, we have developed a prototype that: (1) automatically answer a large part of simple questions in the incoming e-mail flow, (2) improve the quality of the semi- automatic answers (i.e. answer templates), and finally, (3) reduce the workload for the handling officers. The development of the prototype is grounded in an empirical study conducted at the SSIA. The study comprises the analysis and clustering of 10,000 citizens e-mails and the working activity of 15 handling officers that were collected through questionnaires, interviews and workshops [2].

  • 9.
    Dahl, Anders
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Özkan, Atilla
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Pathology text mining - on Norwegian prostate cancer reports2016Ingår i: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW): Proceedings, IEEE Computer Society, 2016, 84-87 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Pathology reports are written by pathologists, skilled physicians, that know how to interpret disorders in various tissue samples from the human body. To obtain valuable statistics on outcome of disorders, as for example cancer and effect of treatment, statistics are collected. Therefore, cancer pathology reports interpreted and coded into databases at cancer registries. In Norway is this task carried out by the Cancer Registry of Norway (Kreftregisteret) by 25 different human coders. There is a need to automate this process. The authors of this article received 25 prostate cancer pathology reports written in Norwegian from the Cancer Registry of Norway, each documenting various stages of prostate cancer and the corresponding correct manual coding. A rule-based algorithm was produced that processed the reports in order to prototype automation. The output of the algorithm was compared to the output of the manual coding. The evaluation showed an average F-Score of 0.94 on four of these data points namely Total Malign, Primary Gleason, Secondary Gleason and Total Gleason and a lower result with on average F-score of 0.76 on all ten data points. The results are in line with previous research.

  • 10.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Clinical text retrieval: an overview of basic building blocks and applications2014Ingår i: Professional search in the modern world: COST action IC1002 on multilingual and multifaceted interactive information access / [ed] Georgios Paltoglou, Fernando Loizides, Preben Hansen, Springer , 2014, 147-165 s.Kapitel i bok, del av antologi (Övrigt vetenskapligt)
    Abstract [en]

    This article describes information retrieval, natural language processing and text mining of electronic patient record text, also called clinical text. Clinical text is written by physicians and nurses to docu- ment the health care process of the patient. First we describe some char- acteristics of clinical text, followed by the automatic preprocessing of the text that is necessary for making it usable for some applications. We also describe some applications for clinicians including spelling and grammar checking, ICD-10 diagnosis code assignment, as well as other applications for hospital management such as ICD-10 diagnosis code validation and detection of adverse events such as hospital acquired infections. Part of the preprocessing makes the clinical text useful for faceted search, al- though clinical text already has some keys for performing faceted search such as gender, age, ICD-10 diagnosis codes, ATC drug codes, etc. Pre- processing makes use of ICD-10 codes and the SNOMED-CT textual descriptions. ICD-10 codes and SNOMED-CT are available in several languages and can be considered the modern Greek or Latin of medical language. The basic research presented here has its roots in the chal- lenges described by the health care sector. These challenges have been partially solved in academia, and we believe the solutions will be adapted to the health care sector in real world applications.

  • 11.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Sök och sammanfatta i Norden2006Ingår i: Sprogteknologi i dansk perspektiv: en samling artikler om sprogforskning og automatisk sprogbehandling / [ed] Anna Braasch, København: Reitsel , 2006Kapitel i bok, del av antologi (Övrigt vetenskapligt)
  • 12.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests2012Ingår i: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 2012, 45-48 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.

  • 13.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care2012Ingår i: Proceedings of SLCT 2012: The Fourth Swedish Language Technology Conference, 2012, 17-18 s.Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    The care of patients is well documented in health records. Despite being a valuable source of information that could be mined by computers and used to improve health care, health records are not readily available for research. Moreover, the narrative parts of the records are noisy and need to be interpreted by domain experts. In this abstract we describe our experiences of gaining access to a database of electronic health records for research. We also highlight some important issues in this domain and describe a number of possible applications, including comorbidity networks, detection of hospital-acquired infections and adverse drug reactions, as well as diagnosis coding support.

  • 14.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Henriksson, Aron
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Weegar, Rebecka
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    HEALTH BANK - A Workbench for Data Science Applications in Healthcare2015Ingår i: Industry Track Workshop, CEUR Workshop Proceedings , 2015, Vol. 1381, 1-18 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    The enormous amounts of data that are generated in the healthcare process and stored in electronic health record (EHR) systems are an underutilized resource that, with the use of data science applica- tions, can be exploited to improve healthcare. To foster the development and use of data science applications in healthcare, there is a fundamen- tal need for access to EHR data, which is typically not readily available to researchers and developers. A relatively rare exception is the large EHR database, the Stockholm EPR Corpus, comprising data from more than two million patients, that has been been made available to a lim- ited group of researchers at Stockholm University. Here, we describe a number of data science applications that have been developed using this database, demonstrating the potential reuse of EHR data to support healthcare and public health activities, as well as facilitate medical re- search. However, in order to realize the full potential of this resource, it needs to be made available to a larger community of researchers, as well as to industry actors. To that end, we envision the provision of an in- frastructure around this database called HEALTH BANK – the Swedish Health Record Research Bank. It will function both as a workbench for the development of data science applications and as a data explo- ration tool, allowing epidemiologists, pharmacologists and other medical researchers to generate and evaluate hypotheses. Aggregated data will be fed into a pipeline for open e-access, while non-aggregated data will be provided to researchers within an ethical permission framework. We believe that HEALTH BANK has the potential to promote a growing industry around the development of data science applications that will ultimately increase the efficiency and effectiveness of healthcare.

  • 15.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Névéol, Aurélie
    Savova, Guergana
    Zweigenbaum, Pierre
    Didactic Panel: clinical Natural Language Processing in Languages Other Than English2014Ingår i: AMIA Annual Symposium 2014: proceedings, American Medical Informatics Association , 2014, S 84- s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Natural Language Processing (NLP) of clinical free-text has received a lot of attention from the scientific community. Clinical documents are routinely created across health care providing institutions and are generally written in the official language(s) of the country these institutions are located in. As a result, free-text clinical information is written in a large variety of languages. While most of the efforts for clinical NLP have focused on English, there is a strong need to extend this work to other languages, for instance in order to gain medical information about patient cohorts in geographical areas where English is not an official language. Furthermore, adapting current NLP methods developed for English to other languages may provide useful insight on the generalizability of algorithms and lead to increased robustness. This panel aims to provide an overview of clinical NLP for languages other than English, as for example French, Swedish and Bulgarian and discuss future methodological advances of clinical NLP in a context that encompasses English as well as other languages.

  • 16.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Informationssystem.
    Rimka, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kann, Viggo
    Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian2007Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The pre-cision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, re-spectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve pre-cision. The collection of ordinary dic-tionaries, consisting of about 200 000 words, only cover 41 of the top 100 search queries at the website. The automatically built trilingual dictionary com-bined with the small manually built trilingual dictionary, consisting of about 2 300 words, and covers 36 of the top search queries.

  • 17.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rosell, Magnus
    Sneiders, Eriks
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Clustering E-Mails for the Swedish Social Insurance Agency - What Part of the E-Mail Thread Gives the Best Quality?2010Ingår i: Advances in Natural Language Processing / [ed] Hrafn Loftsson, Eiríkur Rögnvaldsson, Sigrún Helgadóttir, Berlin: Springer Verlag , 2010, 115-120 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    We need to analyse a large number of e-mails sent by the citizens to the customer services department of a governmental organisation based in Sweden. To carry out this analysis we clustered a large number of e-mails with the aim of automatic e-mail answering. One issue that came up was whether we should use the whole e-mail including the thread or just the original query for the clustering. In this paper we describe this investigation. Our results show that only the query and the answering part should be used, but not necessarily the whole e-mail thread. The results clearly show that the original question contains more useful information than only the answer, although a combination is even better. Using the full e-mail thread does not downgrade the result.

  • 18.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Sjöbergh, Jonas
    KTH, CSC, .
    Sneiders, Eriks
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Comparing Manual Text Patterns and Machine Learning for Classification of E-Mails for Automatic Answering by a Government Agency2011Ingår i: Computational Linguistics and Intelligent Text Processing: Proceedings, Part II / [ed] Alexander Gelbukh, Springer-Verlag Berlin Heidelberg , 2011Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    E-mails to government institutions as well as to large companies may contain a large proportion of queries that can be answered in a uniform way. We analysed and manually annotated 4,404 e-mails from citizens to the Swedish Social Insurance Agency, and compared two methods for detecting an- swerable e-mails: manually-created text patterns (rule-based) and machine learning-based methods. We found that the text pattern-based method gave much higher precision at 89 percent than the machine learning-based method that gave only 63 percent precision. The recall was slightly higher (66 percent) for the machine learning-based methods than for the text patterns (47 percent). We also found that 23 percent of the total e-mail flow was processed by the automatic e-mail answering system.

  • 19.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Creating and Evaluating a Consensus for Negated and Speculative Words in a Swedish Clinical Corpus2010Ingår i: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing ((NeSp-NLP 2010)) / [ed] Roser Morante, Caroline Sporleder, Antwerp: University of Antwerp , 2010, 5-13 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper we describe the creation of a consensus corpus that was obtained through combining three individual annotations of the same clinical corpus in Swedish. We used a few basic rules that were executed automatically to create the consensus. The corpus contains negation words, speculative words, uncertain expressions and certain expressions. We evaluated the consensus using it for negation and speculation cue detection. We used Stanford NER, which is based on the machine learning algorithm Conditional Random Fields for the training and detection. For comparison we also used the clinical part of the BioScope Corpus and trained it with Stanford NER. For our clinical consensus corpus in Swedish we obtained a precision of 87.9 percent and a recall of 91.7 percent for negation cues, and for English with the Bioscope Corpus we obtained a precision of 97.6 percent and a recall of 96.7 percent for negation cues.

  • 20.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields2010Ingår i: Journal of Biomedical Semantics, ISSN 2041-1480, Vol. 1:6Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background

    In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.

    Results

    We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.

    Conclusions

    Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

  • 21.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    How Certain are Clinical Assessments?: Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations2010Ingår i: Proceedings of the of the Seventh International Conference on Language Resources and Evaluation, LREC 2010 / [ed] Nicoletta Calzolari, 2010, 3071-3075 s.Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Clinical texts contain a large amount of information. Some of this information is embedded in contexts where e.g. a patient status is reasoned about, which may lead to a considerable amount of statements that indicate uncertainty and speculation. We believe that distinguishing such instances from factual statements will be very beneficial for automatic information extraction. We have annotated a subset of the Stockholm Electronic Patient Record Corpus for certain and uncertain expressions as well as speculative and negation keywords, with the purpose of creating a resource for the development of automatic detection of speculative language in Swedish clinical text. We have analyzed the results from the initial annotation trial by means of pairwise Inter-Annotator Agreement (IAA) measured with F-score. Our main findings are that IAA results for certain expressions and negations are very high, but for uncertain expressions and speculative keywords results are less encouraging. These instances need to be defined in more detail. With this annotation trial, we have created an important resource that can be used to further analyze the properties of speculative language in Swedish clinical text. Our intention is to release this subset to other research groups in the future after removing identifiable information.

  • 22.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Xing, Haochun
    Zhang, Xin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction2010Ingår i: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 19-21, 2010, European Language Resources Association (ELRA) , 2010, 1700-1705 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

  • 23. de Smedt, Koenraad
    et al.
    Liseth, Anja
    Lech, Till Christopher
    Hassel, Martin
    Wedekind, Jürgen
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Porting and evaluation of automatic summarization2004Ingår i: Nordisk Sprogteknologi 2003: Årbog for Nordisk Språkteknologisk Forskningsprogram 2000-2004 / [ed] Holmboe, H., Museum Tusculanums Forlag , 2004Kapitel i bok, del av antologi (Övrigt vetenskapligt)
  • 24.
    Ehrentraut, Claudia
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ekholm, Markus
    Tanushi, Hideyuki
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Tiedemann, Jörg
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detecting hospital-acquired infections: A document classification approach using support vector machines and gradient tree boosting2016Ingår i: Health Informatics Journal, ISSN 1460-4582, E-ISSN 1741-2811Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Hospital-acquired infections pose a significant risk to patient health, while their surveillance is an additional workload for hospital staff. Our overall aim is to build a surveillance system that reliably detects all patient records that potentially include hospital-acquired infections. This is to reduce the burden of having the hospital staff manually check patient records. This study focuses on the application of text classification using support vector machines and gradient tree boosting to the problem. Support vector machines and gradient tree boosting have never been applied to the problem of detecting hospital-acquired infections in Swedish patient records, and according to our experiments, they lead to encouraging results. The best result is yielded by gradient tree boosting, at 93.7percent recall, 79.7percent precision and 85.7percent F1 score when using stemming. We can show that simple preprocessing techniques and parameter tuning can lead to high recall (which we aim for in screening patient records) with appropriate precision for this task.

  • 25.
    Ehrentraut, Claudia
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ibrahim, Osama
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Text Analysis to support structuring and modelling a public policy problem: outline of an algorithm to extract inferences from textual data2014Ingår i: DSV Writers Hut: proceedings, Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2014Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Policy making situations are real-world problems that exhibit complexity in that they are composed of many interrelated problems and issues. To be effective, policies must holistically address the complexity of the situation rather than propose solutions to single problems. Formulating and understanding the situation and its complex dynamics, therefore, is a key to finding holistic solutions. Analysis of text based information on the policy problem, using Natural Language Processing (NLP) and Text analysis techniques, can support modelling of public policy problem situations in a more objective way based on domain experts’ knowledge and scientific evidence. The objective behind this study is to support modelling of public policy problem situations, using text analysis of verbal descriptions of the problem. We propose a formal methodology for analysis of qualitative data from multiple information sources on a policy problem to construct a causal diagram of the problem. The analysis process aims at identifying key variables, linking them by cause-effect relationships and mapping that structure into a graphical representation that is adequate for designing action alternatives, i.e., policy options. This study describes the outline of an algorithm used to automate the initial step of a larger methodological approach, which is so far done manually. In this initial step, inferences about key variables and their interrelationships are extracted from textual data to support a better problem structuring. A small prototype for this step is also presented.

  • 26.
    Ehrentraut, Claudia
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institutet, Sweden.
    Sparrelid, Elda
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detecting Healthcare-Associated Infections in Electronic Health Records: Evaluation of Machine Learning and Preprocessing Techniques2014Ingår i: Proceedings of the 6th International Symposium on Semantic Mining in Biomedicine (SMBM 2014), University of Aveiro , 2014, 3-10 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Healthcare-associated infections (HAI) are in- fections that patients acquire in the course of medical treatment. Being a severe pub- lic health problem, detecting and monitoring HAI in healthcare documentation is an impor- tant topic to address. Research on automated systems has increased over the past years, but performance is yet to be enhanced. The dataset in this study consists of 214 records obtained from a Point-Prevalence Survey. The records are manually classified into HAI and NoHAI records. Nine different preprocess- ing steps are carried out on the data. Two learning algorithms, Random Forest (RF) and Support Vector Machines (SVM), are applied to the data. The aim is to determine which of the two algorithms is more applicable to the task and if preprocessing methods will affect the performance. RF obtains the best performance results, yielding an F1 -score of 85% and AUC of 0.85 when lemmatisation is used as a preprocessing technique. Irrespec- tive of which preprocessing method is used, RF yields higher recall values than SVM, with a statistically significant difference for all but one preprocessing method. Regarding each classifier separately, the choice of preprocess- ing method led to no statistically significant improvement in performance results.

  • 27.
    Ehrentraut, Claudia
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Sundström, Karin
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Exploration of known and unknown early symptoms of cervical cancer and development of a symptom spectrum: Outline of a data and text mining based approach2015Ingår i: Proceedings of the CAiSE-2015 Industry Track: co-located with 27th Conference on Advanced Information Systems Engineering (CAiSE 2015) / [ed] John Krogstie, Gustaf Juel-Skielse, Vandana Kabilan, 2015, 34-44 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    This position paper lays up the structure of some experiments to detect early symptoms of cervical cancer. We are using a large corpora of electronic patient records texts in Swedish from Karolinska University Hosptital from the years 2009-2010, where we extracted in total 1,660 patients with the diagnosis code C53. We used a Named Entity Recogniser called Clinical Entity Finder to detect the diagnosis and symptoms expressed in these clinical texts containing in total 2,988,118 words. We found 28,218 symptoms and diagnoses on these 1,660 patients. We present some initial findings, and discuss them and propose a set of experiments to find possible early symptoms or at least a spectrum or finger prints for early symptoms of cervical cancer.

  • 28. Ehrentraut, Claudia
    et al.
    Tanushi, Hideyuki
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Tiedmann, Jörg
    Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records: A machine learning approach using Naïve Bayes, Support Vector Machines and C4.52012Ingår i: Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data, Association for Computing Machinery (ACM), 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    Hospital Acquired Infections (HAI) pose a significant risk on patients’ health while their surveillance is an additional work load for hospital medical staff and hospital management. Our overall aim is to build a system which reliably retrieves all patient records which potentially include HAI, to reduce the burden of manually checking patient records by the hospital staff. In other words, we emphasize recall when detecting HAI (aiming at 100%) with the highest precision possible. The present study is of experimental nature, focusing on the application of Naïve Bayes (NB), Support Vector Machines (SVM) and a C4.5 Decision Tree to the problem and the evaluation of the efficiency of this approach. The three classifiers showed an overall similar performance. SVM yielded the best recall value, 89.8%, for records that contain HAI. We present a machine learning approach as an alternative to rule-based systems which are more common in this task. The classifiers were applied on a small and noisy dataset, generating results which pinpoint the potentials of using learning algorithms for detecting HAI. Further research will have to focus on optimizing the performance of the classifiers and to test them on larger datasets.

  • 29.
    Friedrich, Stefanie
    et al.
    Stockholms universitet, Naturvetenskapliga fakulteten, Institutionen för biokemi och biofysik.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Adverse drug event classification of health records using dictionary-based pre-processing and machine learning2015Ingår i: Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis: LOUHI / [ed] Cyril Grouin, Thierry Hamon, Aurélie Névéol, Pierre Zweigenbaum, New York: The association for computational linguistics , 2015, 121-130 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    A method to find adverse drug reactions in electronic health records written in Swedish is presented. A total of 14,751 health records were manually classified into four groups. The records are normalised by pre-processing using both dic- tionaries and manually created word lists. Three different supervised machine learning algorithm were used to find the best results; decision tree, random forest and LibSVM. The best performance on a test dataset was with LibSVM obtaining a pre- cision of 0.69 and a recall of 0.66, and a F-score of 0.67. Our method found 865 of 981 true positives (88.2%) in a 3-class dataset which is an improvement of 49.5% over previous approaches.

  • 30.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kowalski, Stewart
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records2014Ingår i: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, 450-457 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

  • 31.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Identifying adverse drug event information in clinical notes with distributional semantic representations of context2015Ingår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, 333-349 s.Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

  • 32.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska University Hospital.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text2012Ingår i:  , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    A novel method for identifying potential side-effects to medications through large-scale analysis of clinical data is here introduced and evaluated. By calculating distributional similarities for medication-symptom pairs based on co-occurrence information in a large clinical corpus, many known adverse drug reactions are successfully identified. These preliminary results suggest that semantic vector space models of clinical text could also be used to generate hypotheses about potentially unknown adverse drug reactions. In the best model, 50% of the terms in a list of twenty are considered to be conceivable side-effects. Among the medication-symptom pairs, however, diagnostic indications and terms related to the medication in other ways also appear. These relations need to be distinguished in a more refined method for detecting adverse drug reactions.

  • 33.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015Ingår i: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, 343-350 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 34.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event DetectionIngår i: IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE conference proceedingsKonferensbidrag (Refereegranskat)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 35.
    Henriksson, Aron
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Ensembles of randomized trees using diverse distributed representations of clinical events2016Ingår i: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, 69Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

  • 36.
    Jacobson, Olof
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections2016Ingår i: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics, 2016, 191-195 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Detecting healthcare-associated infections pose a major challenge in healthcare. Using natural language processing and machine learning applied on electronic patient records is one approach that has been shown to work. However the results indicate that there was room for improvement and therefore we have applied deep learning methods. Specifically we implemented a network of stacked sparse auto encoders and a network of stacked restricted Boltzmann machines. Our best results were obtained using the stacked restricted Boltzmann machines with a precision of 0.79 and a recall of 0.88.

  • 37.
    Kanhov, Mattias
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Feng, Xuefeng
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Natural Language Generation from SNOMED Specifications2012Ingår i: CLEFeHealth2012: The CLEF 2012 Workshop on Cross-Language Evaluation of Methods, Applications, and Resources for eHealth Document Analysis, 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    SNOMED (Systematized Nomenclature of Medicine) is a compre- hensive clinical terminology that contains almost 400,000 concepts, since SNOMED is a formal language; it is hard to understand for users who are not acquainted with the formal specifications. Natural language generation (NLG) is a technique utilizing computers to create natural language descriptions from formal languages. In order to generate descriptions of SNOMED concepts, two NLG tools were implemented for the English and Swedish version of SNOMED respectively. The one for English used a natural language generator called ASTROGEN to produce description texts. This tool also applied several aggregation rules to make the texts shorter and easier to understand. The other tool used C#.Net as the programming language and applied a template-base generation technique to create concepts explanation in Swedish. As a base line same SNOMED concepts were presented in a tree structure browser. To evaluate the English NLG system, 19 SNOMED concepts were randomly chosen for the generation of text. Ten volunteers participated in this evaluation. Five of them estimated the accuracy of the texts and others assessed the fluency aspect. The sample texts got a mean score 4.37 for accuracy and 4.47 for fluen- cy (max 5 score). To evaluate the Swedish NLG system, five concepts were randomly chosen for the generation of texts. In parallel two physicians with knowledge in SNOMED created manually natural language descriptions of the same concepts. Both manual and system generated natural language descriptions were evaluat- ed and compared by in total four physicians. All respondents scored the manual natural language descriptions the highest in average 83 of 100 scores while the system generated natural language texts obtained around 68 of 100 scores. All three respondents unanimously except one respondent (scoring 7 of 10) pre- ferred the system-generated text. This paper presents a possible way using Natural Language Generation to explain the meaning of SNOMED concepts for people who are not familiar with SNOMED formal language. The evaluation results indicate that the NLG techniques can be used to implement this task.

  • 38. Karopka, Thomas
    et al.
    Mernitz, Gudrun
    Mogensen, Mads Lause
    Reich, Stefan
    Lövström, Rikard
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Leslie, Heather
    Behnke, Michael
    Minimum data set and standards in the context of nosocomial infection surveillance and antimicrobial stewardship2016Ingår i: HEC 2016 Health - Exploring Complexity: An Interdisciplinary Systems Approach, IOS Press, 2016Konferensbidrag (Refereegranskat)
    Abstract [en]

    Antimicrobial resistance (AMR), i.e., the ability of microbes such as bacteria, viruses, fungi and parasites to resist the actions of one or more antimicrobial drugs or agents, is a serious global threat. Bacterial antibiotic resistance poses the largest threat to public health. The prevention of antimicrobial infections and their spread relies heavily on infection control management, and requires urgent, coordinated action by many stakeholders. This is especially true for nosocomial infections, also known as healthcare-associated infections (HAIs), i.e., infections that are acquired in healthcare settings. It is known that continuous, systematic collection, analysis and interpretation of data relevant to nosocomial infections and feedback for the use by doctors and nurses can reduce the frequency of these infections. Data from one hospital are more valid and more effective when they are compared with those from other hospitals. In order to avoid false conclusions, comparisons are only possible when identical methods of data collection with fixed diagnostic definitions are used. The automatic aggregation of standardized data using data from electronic medical records (EMRs), lab data, surveillance data and data on antibiotic use would greatly enhance comparison and computerized decision support systems (CDSSs). Once standardized, data can be aggregated from unit to institutional, regional, national and EU level, analysed and fed back to enhance local decision support on antibiotic use and detection of nosocomial infections.

  • 39. Meystre, Stéphane M.
    et al.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Aberdeen, John
    Malin, Brad
    Automatic clinical text de-identification: is it worth it, and could it work for me?2013Ingår i: Proceedings of the 14th World Congress on Medical and Health Informatics / [ed] Christoph Ulrich Lehmann, Elske Ammenwerth, Christian Nøhr, IOS Press, 2013, Vol. 192, 1242-1242 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    The increased use and adoption of Electronic Health Records, and parallel growth in patient data available for secondary use by clinicians, researchers, and operational purposes, all cause patient confidentiality protection to become an increasingly more important requirement and expectation. The laws protecting patient confidentiality typically require the informed consent of the patient to use data for research purposes, a requirement that can be waived if the data are de-identified. Several methods to automatically remove identifying information from clinical text have been tested experimen- tally over the last 10 years, guided by the HIPAA “Safe Harbor” methodology. This panel will focus on the issues related with the automatic de-identification of clinical text. It will include an overview of the domain, a demonstration of good examples of such applications in English and in Swedish with their main authors sharing development and adaptation experiences, and a discussion of the HIPAA “Safe Harbor” de-identification quality and the risk for re-identification of de-identified data. The difficulties and issues related to this task will be debated, as well as the main methods used and the performance and adaptability of these methods.

  • 40. Nizamuddin, Uddin
    et al.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Detection of Spelling Errors in Swedish Clinical Text2014Ingår i: NorWES T2014, 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    Spelling errors are common in clinical text because such text is written under pressure and lack of time. It is mostly used for internal communication. To improve text mining and other type of text processing tools, spelling error detection and correction is needed. In this paper we will count spelling errors in Swedish clinical text. The developed algorithm uses word lists for detection such as a Swedish general dictionary, a medical dictionary and a list of abbreviations. The final algorithm has been tested on a Swedish clinical corpus, we obtained 12 per cent spelling errors. After error analysis of the result, it was concluded that many errors were detected by the algorithm due to inadequate word list and faulty preprocessing such as lemmatization and compound splitting. By manually removing these correct words from the list, total spelling errors were decreased to 7.6 per cent.

  • 41. Ntais, Georgios
    et al.
    Saroukos, Spyridon
    Berki, Eleni
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Development and Enhancement of a Stemmer for the Greek Language2016Ingår i: Proceedings of the 20th Pan-Hellenic Conference on Informatics, Association for Computing Machinery (ACM), 2016, 3Konferensbidrag (Refereegranskat)
    Abstract [en]

    Although there are three stemmers published for the Greek language, only the one presented in this paper and called Ntais’ stemmer is freely open and available, together with its enhancements and extensions according to Saroukos’ algorithm. The primary algorithm (Ntais’ algorithm) uses only capital letters and works with better performance than other past stemming algorithms for the Greek language, giving 92.1 percent correct results. Further extensions of the proposed stemming system (e.g. from capital to small letters) and more evaluation methods are presented according to a new and improved algorithm, Saroukos’ algorithm. Stemmer performance metrics are further used for evaluating the existing stemming system and algorithm and show how its accuracy and completeness are enhanced. The improvements were possible by providing an alternative implementation in the programming language PHP, which offers more syntactical rules and exceptions. The two versions of the stemming algorithm are tested and compared.

  • 42. Perez, Alicia
    et al.
    Weegar, Rebecka
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Casillas, Arantza
    Gojenola, Koldo
    Oronoz, Maite
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora2017Ingår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 71, 16-30 s.Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Objective: The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations. Methods: The significance of this work stands on its experimental layout. The experiments were carried out under the same conditions for both languages. Several classification approaches were explored: maximum probability, CRF, Perceptron and SVM. The classifiers were enhanced by means of ensembles of semantic spaces and ensembles of Brown trees. In order to mitigate sparsity of data, without a significant increase in the dimension of the decision space, we propose the use of clustered approaches of the hierarchical Brown clustering represented by trees and vector quantization for each semantic space. Results: The results showed that the semi-supervised approaches significantly improved standard supervised techniques for both languages. Moreover, clustering the semantic spaces contributed to the quality of the entity recognition while keeping the dimension of the feature-space two orders of magnitude lower than when directly using the semantic spaces. Conclusions: The contributions of this study are: (a) a set of thorough experiments that enable comparisons regarding the influence of different types of features on different classifiers, exploring two languages other than English; and (b) the use of ensembles of clusters of Brown trees and semantic spaces on EHRs to tackle the problem of scarcity of available annotated data.

  • 43.
    Skeppstedt, Maria
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Using active learning and pre-tagging for annotating clinical findings in health record text2012Ingår i: Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012) / [ed] Ananiadou, Sophia; Pyysalo, Sampo; Rebholz-Schuhmann, Dietrich; Rinaldi, Fabio; Salakoski, Tapio, Institute of Computational Linguistics, University of Zurich , 2012, 98-99 s.Konferensbidrag (Övrig (populärvetenskap, debatt, mm))
    Abstract [en]

    A method that combines pre-tagging with a version of active learning is proposed for annotating named entities in clinical text.

  • 44.
    Skeppstedt, Maria
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Nilsson, Gunnar H.
    ent of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm.
    Retrieving disorders and findings: Results using SNOMED CT and NegEx adapted for Swedish2011Ingår i: LOUHI 2011 Health Document Text Mining and Information Analysis 2011: Proceedings of LOUHI 2011 Third International Workshop on Health Document Text Mining and Information AnalysisBled, Slovenia, July 6, 2011. / [ed] Øystein Nytrø, Laura Slaughter, Hans Moen, 2011, 11-17 s.Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Access to reliable data from electronic health records is of high importance in several key areas in patient care, biomedical research, and education. However, many of the clinical entities are negated in the patient record text. Detecting what is a negation and what is not is therefore a key to high quality text mining. In this study we used the NegEx system adapted for Swedish to investigate negated clinical entities. We applied the system to a subset of free-text entries under a heading containing the word ‘assessment’ from the Stockholm EPR corpus, containing in total 23,171,559 tokens. Specifically, the explored entities were the SNOMED CT terms having the semantic categories ‘finding’ or ‘disorder’. The study showed that the proportion of negated clinical entities was around 9%. The results thus support that negations are abundant in clinical text and hence negation detection is vital for high quality text mining in the medical domain.

  • 45.
    Skeppstedt, Maria
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text2012Ingår i: LREC 2012 8th ELRA Conference on Language Resources and Evaluation: Proceedings, European Language Resources Association (ELRA) , 2012, 1250-1257 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Named entity recognition of the clinical entities disorders, findings and body structures is needed for information extraction from unstructured text in health records. Clinical notes from a Swedish emergency unit were annotated and used for evaluating a rule- and terminology-based entity recognition system. This system used different preprocessing techniques for matching terms to SNOMED CT, and, one by one, four other terminologies were added. For the class body structure, the results improved with preprocessing, whereas only small improvements were shown for the classes disorder and finding. The best average results were achieved when all terminologies were used together. The entity body structure was recognised with a precision of 0.74 and a recall of 0.80, whereas lower results were achieved for disorder (precision: 0.75, recall: 0.55) and for finding (precision: 0.57, recall: 0.30). The proportion of entities containing abbreviations were higher for false negatives than for correctly recognised entities, and no entities containing more than two tokens were recognised by the system. Low recall for disorders and findings shows both that additional methods are needed for entity recognition and that there are many expressions in clinical text that are not included in SNOMED CT.

  • 46.
    Skeppstedt, Maria
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska University Hospital, Sweden; Karolinska Institutet, Sweden.
    Nilsson, Gunnar H.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study2014Ingår i: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 49, 148-158 s.Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder + Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations.

  • 47.
    Tanushi, Hideyuki
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Duneld, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap. Karolinska University Hospital.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Velupillai, Sumithra
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg2013Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) / [ed] Stephan Oepen, Kristin Hagen, Janne Bondi Johannessen, Linköping: Linköping University Electronic Press , 2013, 387-474 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Negation detection is a key component in clinical information extraction systems, as health record text contains reasonings in which the physician excludes different diagnoses by negating them. Many systems for negation detection rely on negation cues (e.g. not), but only few studies have investigated if the syntactic structure of the sentences can be used for determining the scope of these cues. We have in this paper compared three different systems for negation detection in Swedish clinical text (NegEx, PyConTextNLP and SynNeg), which have different approaches for determining the scope of negation cues. NegEx uses the distance between the cue and the disease, PyConTextNLP relies on a list of conjunctions limiting the scope of a cue, and in SynNeg the boundaries of the sentence units, provided by a syntactic parser, limit the scope of the cues. The three systems produced similar results, detecting negation with an F-score of around 80%, but using a parser had advantages when handling longer, complex sentences or short sentences with contradictory statements.

  • 48.
    ul Muntaha, Sidrat
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Skeppstedt, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Entity Recognition of Pharmaceutical Drugs in Swedish Clinical Text2012Ingår i: Proceedings of the Conference, 2012, 77-78 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    An entity recognition system for expressions of pharmaceutical drugs, based on vocabulary lists from FASS, the Medical Subject Headings and SNOMED~CT, achieved a precision of 94\% and a recall of 74\% when evaluated on assessment texts from Swedish emergency unit health records.

  • 49.
    Velupillai, Sumithra
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Hassel, Martin
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Nilsson, Gunnar
    Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial2009Ingår i: International Journal of Medical Informatics, ISSN 1386-5056, E-ISSN 1872-8243, Vol. 78, nr 12, e19-e26 s.Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background

    Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish.

    Methods

    This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure.

    Results

    This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results.

    Conclusion

    Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

  • 50.
    Velupillai, Sumithra
    et al.
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Kvist, Maria
    Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
    Factuality Levels of Diagnoses in Swedish Clinical Text2011Ingår i: User Centred Networked Health Care - Proceedings of MIE 2011 / [ed] Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen, 2011, 559-563 s.Konferensbidrag (Refereegranskat)
    Abstract [en]

    Different levels of knowledge certainty, or factuality levels, are expressed in clinical health record documentation. This information is currently not fully exploited, as the subtleties expressed in natural language cannot easily be machine analyzed. Extracting relevant information from knowledge-intensive resources such as electronic health records can be used for improving health care in general by e.g. building automated information access systems. We present an annotation model of six factuality levels linked to diagnoses in Swedish clinical assessments from an emergency ward. Our main findings are that overall agreement is fairly high (0.7/0.58 F-measure, 0.73/0.6 Cohen's κ, Intra/Inter). These distinctions are important for knowledge models, since only approx. 50% of the diagnoses are affirmed with certainty. Moreover, our results indicate that there are patterns inherent in the diagnosis expressions themselves conveying factuality levels, showing that certainty is not only dependent on context cues.

12 1 - 50 av 58
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf