Automatic de-identification of case narratives from spontaneous reports in VigiBase
Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
The use of patient data is essential in research but it is on the other hand confidential and can only be used after acquiring approval from an Ethical Board and informed consent from the individual patient. A large amount of patient data is therefore difficult to obtain if sensitive information, such as names, id numbers and contact details, are not removed from the data, by so called de-identification. Uppsala Monitoring Centre maintains the world's larges database of individual case reports of any suspected adverse drug reaction. There exists, of today, no method for efficiently de-identifying the narrative text included in these which causes countries like the United States of America reports to exclude the narratives in the reports.
The aim of this thesis is to develop and evaluate a method for automatic de-identification of case narratives in reports from the WHO Global Individual Case Safety Report Database System, VigiBase. This report compares three different models, namely Regular Expressions, used for text pattern matching, and the statistical models Support Vector Machine (SVM) and Conditional Random Fields (CRF). Performance, advantages and disadvantages are discussed as well as how identified sensitive information is handled to maintain readability of the narrative text. The models developed in this thesis are also compared to existing solutions to the de-identification problem.
The 400 reports extracted from VigiBase were already well de-identified in terms of names, ID numbers and contact details, making it difficult to train statistical models on these categories. The reports did however, contain plenty of dates and ages. For these categories Regular Expression would be sufficient to achieve a good performance. To identify entities in other categories more advanced methods such as the SVM and CRF are needed and will require more data. This was prominent when applying the models on the more information rich i2b2 de-identification challenge benchmark data set where the statistical models developed in this thesis performed at a competing level with existing models in the literature.
Place, publisher, year, edition, pages
UPTEC F, ISSN 1401-5757 ; 15054
de-identification, svm, crf, regex, VigiBase, i2b2
Computer and Information Science
IdentifiersURN: urn:nbn:se:uu:diva-262158OAI: oai:DiVA.org:uu-262158DiVA: diva2:852410
Uppsala Monitoring Centre
Master Programme in Engineering Physics
2015-02-26, 10:07 (Swedish)
Nyberg, TomasCassel, Sofia