Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
AfriWOZ: Corpus for Exploiting Cross-Lingual Transfer for Dialogue Generation in Low-Resource, African Languages
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab. Masakhane.ORCID iD: 0000-0002-5582-2031
Masakhane.
Masakhane.
CIS.
Show others and affiliations
2023 (English)In: IJCNN 2023 - International Joint Conference on Neural Networks, Conference Proceedings, Institute of Electrical and Electronics Engineers Inc. , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. There are a total of 9,000 turns, each language having 1,500 turns, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we benchmark by investigating & analyzing the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc. , 2023.
Series
Proceedings of the International Joint Conference on Neural Networks, ISSN 2161-4393, E-ISSN 2161-4407
Keywords [en]
crosslingual, dialogue systems, low-resource, multilingual, NLG
National Category
Language Technology (Computational Linguistics) Computer Sciences
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-101305DOI: 10.1109/IJCNN54540.2023.10191208ISI: 001046198701045Scopus ID: 2-s2.0-85169561924ISBN: 978-1-6654-8868-6 (print)ISBN: 978-1-6654-8867-9 (electronic)OAI: oai:DiVA.org:ltu-101305DiVA, id: diva2:1796215
Conference
2023 International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, June 18-23, 2023
Available from: 2023-09-12 Created: 2023-09-12 Last updated: 2024-03-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Adewumi, TosinLiwicki, FoteiniLiwicki, Marcus
By organisation
Embedded Internet Systems Lab
Language Technology (Computational Linguistics)Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 185 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf