Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Pseudonymization of Language Learner Data
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Datorlingvistik)ORCID iD: 0000-0002-4838-6518
Göteborgs universitet. (Språkbanken)
2019 (English)In: Workshop om pseudonymisering av textdata, 2019Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

We present de-identification and pseudonymization of a learner corpus within the ongoing research infrastructure project SweLL[1]. The main project aim is to make available a linguistically annotated corpus of essays written by second language (L2) learners of Swedish. To ensure that the data collected in the project can be used openly in research protecting the subjects’ integrity, we developed data handling flow, a set of metadata about the learners, pseudonymization principles of learner texts, and tools in support of pseudonymization.  During data collection and storage, the data needs to be handled in a secure way, and the participating subjects must be de-identified in the corpus, where common personal identifiers such as names, age, geographic places, dates must be identified, masked and eventually replaced. These identifiers might occur in metadata about the learner, and in the learners’ text(s).

 

The SweLL project adopted a rather restrictive approach to metadata describing important aspects about each produced text and learner so that learners are de-identified while still providing important information for research purposes about the learner's gender, age given in 5-year interval spans, total time in Sweden, education level, mother tongue, and languages spoken in various communicative situations. The metadata does not provide exact date of birth, arrival date to Sweden, the country of origin or nationality of the learner, and no information is given about the educational establishment, where the essays have been collected.

 

De-identification through metadata might not be solely satisfactory, since the texts written by a learner may, and in fact often contain personal information about the learner. Pseudonymization involves the identification of personal information that can relate to the subject (e.g. My name is Ali), and the classification of that information, masked into certain predefined types (e.g. My name is first_name). As the first step, we manually mark-up text segments that reveal personal information in the corpus data. The identified segments are categorized as personal names, institutions (referring to schools, work place, sport teams), geographic data (such as country, city, region, areas, street name, numbers), transportation types and line names/numbers, age, date, phone number, email address, personal web page, social security number, account number, certificate/license number, profession and education, and sensitive information revealing physical or mental disabilities, political views, unique family relations, and any other items not covered by the previous categories.

 

Each marked text string with a category is then replaced in a systematic way to reproduce a "natural" text to increase reading flow. This step includes assigning unique id-numbers to each entity within a certain category type so if the particular entity is repeated in the text, the same running number is assigned to it and can be replaced by the same word. We also add morphological information to each masked entity to be able to replace it in the same morphological form as the original.

 

There are several ways to mask the sensitive information through substitution, either by rendering, or by replacement with another pre-defined token of the same category. Rendering is applied to information that can be collected from general resource lists, such as personal names and surnames; city and country names, nationalities and languages; geographic names; street names; names of schools, institutions, work places; etc. Replacement applies to strings containing information with certain formatting where general resource lists cannot suffice. Such cases include middle names or initials, numerical information such as phone numbers or dates. In some cases, when the annotator does not know how to categorize a certain text string, the original text is kept but marked by a placeholder. Distinction is made between objects that need to be replaced because of sensitivity, and objects that might be sensitive but can be replaced later, or to be removed later.

 

The pseudonymized corpus is under development, as are the tools supporting the pseudonymization process.

We expect the corpus and the tools to be released as open source by the end of 2020.

[1] https://spraakbanken.gu.se/eng/swell_infra

 

 

Place, publisher, year, edition, pages
2019.
Keywords [en]
pseudonymisation, GDPR, personal information
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-385921OAI: oai:DiVA.org:uu-385921DiVA, id: diva2:1326455
Conference
Workshop om pseudonymisering av textdata, 22 mars 2019, Stockholm, Sweden
Projects
SweLL
Funder
Riksbankens Jubileumsfond, IN16-0464:1Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-08-27Bibliographically approved

Open Access in DiVA

fulltext(509 kB)7 downloads
File information
File name FULLTEXT02.pdfFile size 509 kBChecksum SHA-512
0287ab22cc93f4ba15ad58e5d2998457053c5e93c96a668a916854268c0c9eba9ce5423c3eaef162cb526899904d53e124ad69f894304295c90ea8355b3c6b81
Type fulltextMimetype application/pdf

Other links

https://www.ida.liu.se/~larah03/Wshop190322/LLdata.docx

Search in DiVA

By author/editor
Megyesi, Beáta
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 7 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 7 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf