Building open-domain conversational systems (or chatbots) that produce convincing responses is a recognized challenge. Recent state-of-the-art (SoTA) transformer-based models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English.This work investigates, by an empirical study, the potential for transfer learning of such models to Swedish language. DialoGPT, an English language pre-trained model, is adapted by training on three different Swedish language conversational datasets obtained from publicly available sources: Reddit, Familjeliv and the GDC. Perplexity score (an automated intrinsic metric) and surveys by human evaluation were used to assess the performances of the fine-tuned models. We also compare the DialoGPT experiments with an attention-mechanism-based seq2seq baseline model, trained on the GDC dataset. The results indicate that the capacity for transfer learning can be exploited with considerable success. Human evaluators asked to score the simulated dialogues judged over 57% of the chatbot responses to be human-like for the model trained on the largest (Swedish) dataset. The work agrees with the hypothesis that deep monolingual models learn some abstractions which generalize across languages. We contribute the codes, datasets and model checkpoints and host the demos on the HuggingFace platform.
We conduct relatively extensive investigations of automatic hate speech (HS) detection using different State-of-The-Art (SoTA) baselines across 11 subtasks spanning six different datasets. Our motivation is to determine which of the recent SoTA models is best for automatic hate speech detection and what advantage methods, such as data augmentation and ensemble, may have on the best model, if any. We carry out six cross-task investigations. We achieve new SoTA results on two subtasks—macro F1 scores of 91.73% and 53.21% for subtasks A and B of the HASOC 2020 dataset, surpassing previous SoTA scores of 51.52% and 26.52%, respectively. We achieve near-SoTA results on two others—macro F1 scores of 81.66% for subtask A of the OLID 2019 and 82.54% for subtask A of the HASOC 2021, in comparison to SoTA results of 82.9% and 83.05%, respectively. We perform error analysis and use two eXplainable Artificial Intelligence (XAI) algorithms (Integrated Gradient (IG) and SHapley Additive exPlanations (SHAP)) to reveal how two of the models (Bi-Directional Long Short-Term Memory Network (Bi-LSTM) and Text-to-Text-Transfer Transformer (T5)) make the predictions they do by using examples. Other contributions of this work are: (1) the introduction of a simple, novel mechanism for correcting Out-of-Class (OoC) predictions in T5, (2) a detailed description of the data augmentation methods, and (3) the revelation of the poor data annotations in the HASOC 2021 dataset by using several examples and XAI (buttressing the need for better quality control). We publicly release our model checkpoints and codes to foster transparency.
We investigate five English NLP benchmark datasets (on the superGLUE leaderboard) and two Swedish datasets for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), Recognising Textual Entailment (RTE), Swedish CB, and SWEDN. Bias can be harmful and it is known to be common in data, which ML models learn from. In order to mitigate bias in data, it is crucial to be able to estimate it objectively. We use bipol, a novel multi-axes bias metric with explainability, to estimate and explain how much bias exists in these datasets. Multilingual, multi-axes bias evaluation is not very common. Hence, we also contribute a new, large Swedish bias-labeled dataset (of 2 million samples), translated from the English version and train the SotA mT5 model on it. In addition, we contribute new multi-axes lexica for bias detection in Swedish. We make the codes, model, and new dataset publicly available.
Cerebral palsy (CP) is a common reason for human motor ability limitations caused before birth, through infancy or early childhood. Poor head control is one of the most important problems in children with level IV CP and level V CP, which can affect many aspects of children's lives. The current visual assessment method for measuring head control ability and cervical range of motion (CROM) lacks accuracy and reliability. In this paper, a HeadUp system that is based on a low-cost, 9-axis, inertial measurement unit (IMU) is proposed to capture and evaluate the head control ability for children with CP. The proposed system wirelessly measures CROM in frontal, sagittal, and transverse planes during ordinary life activities. The system is designed to provide real-time, bidirectional communication with an Euler-based, sensor fusion algorithm (SFA) to estimate the head orientation and its control ability tracking. The experimental results for the proposed SFA show high accuracy in noise reduction with faster system response. The system is clinically tested on five typically developing children and five children with CP (age range: 2-5 years). The proposed HeadUp system can be implemented as a head control trainer in an entertaining way to motivate the child with CP to keep their head up.
In this paper, we first describe various synchronous and asynchronous methods for enhancing student engagement in big online courses. We showcase the implementation of these methods in the “Introduction to Artificial Intelligence (AI)” course at Luleå University of Technology, which has attracted around 500 students in each of its iterations (twice yearly, since 2019). We also show that these methods can be applied efficiently, in terms of the teaching hours required. With the increase in digitization and student mobility, the demand for improved and personalized content delivery for distance education has also increased. This applies not only in the context of traditional undergraduate education, but also in the context of adult education and lifelong learning. This higher level of demand, however, introduces a challenge, especially as it is typically combined with a shortage of staff and needs for efficient education. This challenge is further amplified by the current pandemic situation, which led to an even bigger risk of student-dropout. To mitigate this risk, as well as to meet the increased demand, we applied various methods for creating engaging interaction in our pedagogy based on Moor’s framework: learner-to-learner, learner-to-instructor, and learner-to-content engagement strategies. The main methods of this pedagogy are as follows: short, and interactive videos, active discussions in topic-based forums, regular live sessions with group discussions, and the introduction of optional content at many points in the course, to address different target groups. In this paper, we show how we originally designed and continuously improved the course, without requiring more than 500 teaching hours per iteration (one hour per enrolled student), while we also managed to increase the successful completion rate of the participants by 10%, and improved student engagement and feedback for the course by 50%. We intend to share a set of best-practices applicable to many other e-learning courses in ICT.
We introduce bipol, a new metric with explainability, for estimating social bias in text data. Harmful bias is prevalent in many online sources of data that are used for training machine learning (ML) models. In a step to address this challenge we create a novel metric that involves a two-step process: corpus-level evaluation based on model classification and sentence-level evaluation based on (sensitive) term frequency (TF). After creating new models to classify bias using SotA architectures, we evaluate two popular NLP datasets (COPA and SQuADv2) and the WinoBias dataset. As additional contribution, we created a large English dataset (with almost 2 million labeled samples) for training models in bias classification and make it publicly available. We also make public our codes.
Hateful content is published and spread on social media at an increasing rate, harming the user experience.In addition, hateful content targeting particular, marginalized/vulnerable groups (e.g. homophobic/trans-phobic content) can cause even more harm to members of said groups. Hence, detecting hateful contentis crucial, regardless of its origin, or the language used. The large variety of (often underresourced)languages used, however, makes this task daunting, especially as many users use code-mixing in theirmessages. To help overcome these difficulties, the approach we present here uses a multi-languageframework. And to further mitigate the scarcity of labelled data, it also leverages data from the relatedtask of sentiment-analysis to improve the detection of homophobic/transphobic content. We evaluatedour system by participating in a sentiment analysis and hate speech detection challenge. Results showthat our multi-task model outperforms its single-task counterpart (on average, by 24%) on the detection ofhomophobic/transphobic content. Moreover, the results achieved in detecting homophobic/transphobiccontent put our system in 1st or 2nd place for three out of four languages examined.
We investigate the performance of a state-of-the-art (SoTA) architecture T5 (available on the SuperGLUE) and compare it with 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using a new autoregressive conversational AI model checkpoint. We achieve near-SoTA results on a couple of the tasks - macro F1 scores of 81.66% for task A of the OLID 2019 dataset and 82.54% for task A of the hate speech and offensive content (HASOC) 2021 dataset, where SoTA are 82.9% and 83.05%, respectively. We perform error analysis and explain why one of the models (Bi-LSTM) makes the predictions it does by using a publicly available algorithm: Integrated Gradient (IG). This is because explainable artificial intelligence (XAI) is essential for earning the trust of users. The main contributions of this work are the implementation method of T5, which is discussed; the data augmentation, which brought performance improvements; and the revelation on the shortcomings of the HASOC 2021 dataset. The revelation shows the difficulties of poor data annotation by using a small set of examples where the T5 model made the correct predictions, even when the ground truth of the test set were incorrect (in our opinion). We also provide our model checkpoints on the HuggingFace hub1. https://huggingface.co/sana-ngu/HaT5_augmentation https://huggingface.co/sana-ngu/HaT5.
Despite the recent progress on scaling multilingual machine translation (MT) to severalunder-resourced African languages, accuratelymeasuring this progress remains challenging,since evaluation is often performed on n-grammatching metrics such as BLEU, which typically show a weaker correlation with humanjudgments. Learned metrics such as COMEThave higher correlation; however, the lack ofevaluation data with human ratings for underresourced languages, complexity of annotationguidelines like Multidimensional Quality Metrics (MQM), and limited language coverageof multilingual encoders have hampered theirapplicability to African languages. In this paper, we address these challenges by creatinghigh-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AFRICOMET: COMETevaluation metrics for African languages byleveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-theart MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).