This paper introduces a machine learning-based approach for detecting electric poles, an essential part of power grid maintenance. With the increasing popularity of deep learning, several such approaches have been proposed for electric pole detection. However, most of these approaches are supervised, requiring a large amount of labeled data, which is time-consuming and labor-intensive. Unsupervised deep learning approaches have the potential to overcome the need for huge amounts of training data. This paper presents an unsupervised deep learning framework for utility pole detection. The framework combines Convolutional Neural Network (CNN) and clustering algorithms with a selection operation. The CNN architecture for extracting meaningful features from aerial imagery, a clustering algorithm for generating pseudo labels for the resulting features, and a selection operation to filter out reliable samples to fine-tune the CNN architecture further. The fine-tuned version then replaces the initial CNN model, thus improving the framework, and we iteratively repeat this process so that the model learns the prominent patterns in the data progressively. The presented framework is trained and tested on a small dataset of utility poles provided by “Mention Fuvex” (a Spanish company utilizing long-range drones for power line inspection). Our extensive experimentation demonstrates the progressive learning behavior of the proposed method and results in promising classification scores with significance test having p−value<0.00005 on the utility pole dataset.
Massive wildfires not only in Australia, but also worldwide are burning millions of hectares of forests and green land affecting the social, ecological, and economical situation. Widely used indices-based threshold methods like Normalized Burned Ratio (NBR) require a huge amount of data preprocessing and are specific to the data capturing source. State-of-the-art deep learning models, on the other hand, are supervised and require domain experts knowledge for labeling the data in huge quantity. These limitations make the existing models difficult to be adaptable to new variations in the data and capturing sources. In this work, we have proposed an unsupervised deep learning based architecture to map the burnt regions of forests by learning features progressively. The model considers small patches of satellite imagery and classifies them into burnt and not burnt. These small patches are concatenated into binary masks to segment out the burnt region of the forests. The proposed system is composed of two modules: 1) a state-of-the-art deep learning architecture for feature extraction and 2) a clustering algorithm for the generation of pseudo labels to train the deep learning architecture. The proposed method is capable of learning the features progressively in an unsupervised fashion from the data with pseudo labels, reducing the exhausting efforts of data labeling that requires expert knowledge. We have used the realtime data of Sentinel-2 for training the model and mapping the burnt regions. The obtained F1-Score of 0.87 demonstrates the effectiveness of the proposed model.
Seagrass ecosystems are pivotal in marine environments, serving as crucial habitats for diverse marine species and contributing significantly to carbon sequestration. Accurate classification of seagrass species from underwater images is imperative for monitoring and preserving these ecosystems. This paper introduces Unsupervised Curriculum Learning (UCL) to seagrass classification using the DeepSeagrass dataset. UCL progressively learns from simpler to more complex examples, enhancing the model's ability to discern seagrass features in a curriculum-driven manner. Experiments employing state-of-the-art deep learning architectures, convolutional neural networks (CNNs), show that UCL achieved overall 90.12 % precision and 89 % recall, which significantly improves classification accuracy and robustness, outperforming some traditional supervised learning approaches like SimCLR, and unsupervised approaches like Zero-shot CLIP. The methodology of UCL involves four main steps: high-dimensional feature extraction, pseudo-label generation through clustering, reliable sample selection, and fine-tuning the model. The iterative UCL framework refines CNN's learning of underwater images, demonstrating superior accuracy, generalization, and adaptability to unseen seagrass and background samples of undersea images. The findings presented in this paper contribute to the advancement of seagrass classification techniques, providing valuable insights into the conservation and management of marine ecosystems. The code and dataset are made publicly available and can be assessed here: https://github.com/nabid69/Unsupervised-Curriculum-Learning—UCL.
Building open-domain conversational systems (or chatbots) that produce convincing responses is a recognized challenge. Recent state-of-the-art (SoTA) transformer-based models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English.This work investigates, by an empirical study, the potential for transfer learning of such models to Swedish language. DialoGPT, an English language pre-trained model, is adapted by training on three different Swedish language conversational datasets obtained from publicly available sources: Reddit, Familjeliv and the GDC. Perplexity score (an automated intrinsic metric) and surveys by human evaluation were used to assess the performances of the fine-tuned models. We also compare the DialoGPT experiments with an attention-mechanism-based seq2seq baseline model, trained on the GDC dataset. The results indicate that the capacity for transfer learning can be exploited with considerable success. Human evaluators asked to score the simulated dialogues judged over 57% of the chatbot responses to be human-like for the model trained on the largest (Swedish) dataset. The work agrees with the hypothesis that deep monolingual models learn some abstractions which generalize across languages. We contribute the codes, datasets and model checkpoints and host the demos on the HuggingFace platform.
We reflect on the experiences in organizing and implementing a high-quality Blended Intensive Programme (BIP) as a joint international event. A BIP is a short programme that combines physical mobility with a virtual part. The 6-day event, titled “DigiHealth-AI: Practice, Research, Ethics, and Regulation”, was organized in collaboration with partners from five European nations and support from the EU’s ERASMUS+ programme in November 2023. We introduced a new learning method called ProCoT, involving large language models (LLMs), for preventing cheating by students in writing. We designed an online survey of key questions, which was conducted at the beginning and the end of the BIP. The highlights of the survey are as follows: By the end of the BIP, 84% of the respondents agreed that the intended learning outcomes (ILOs) were fulfilled, 100% strongly agreed that artificial intelligence (AI) benefits the healthcare sector, 62% disagree that they are concerned about AI potentially eliminating jobs in the healthcare sector (compared to 57% initially), 60% were concerned about their privacy when using AI, and 56% could identify, at least, two known sources of bias in AI systems (compared to only 43% prior to the BIP). A total of 541 votes were cast by 40 students, who were the respondents. The minimum and maximum numbers of students who answered any particular survey question at a given period are 25 and 40, respectively.
This essay discusses current research efforts in conversational systems from the philosophy of science point of view and evaluates some conversational systems research activities from the standpoint of naturalism philosophical theory. Conversational systems or chatbots have advanced over the decades and now have become mainstream applications. They are software that users can communicate with, using natural language. Particular attention is given to the Alime Chat conversational system, already in industrial use, and the related research. The competitive nature of systems in production is a result of different researchers and developers trying to produce new conversational systems that can outperform previous or state-of-the-art systems. Different factors affect the quality of the conversational systems produced, and how one system is assessed as being better than another is a function of objectivity and of the relevant experimental results. This essay examines the research practices from, among others, Longino’s view on objectivity and Popper’s stand on falsification. Furthermore, the need for qualitative and large datasets is emphasized. This is in addition to the importance of the peer-review process in scientific publishing, as a means of developing, validating, or rejecting theories, claims, or methodologies in the research community. In conclusion, open data and open scientific discussion fora should become more prominent over the mere publication-focused trend.
In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The Gigaword and Wikipedia, in analogy (intrinsic) tests and discover that the embeddings from the Wikipedia corpus generally outperform those from the Gigaword corpus, which is a bigger corpus. Downstream tests will be required to have a definite evaluation.
In this paper, we show that embeddings from relatively smaller corpora sometimes outperform thosefrom larger corpora and we introduce a new Swedish analogy test set and make it publicly available.To achieve good performance in Natural Language Processing (NLP) downstream tasks, several factorsplay important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We utilizethe fastText tool for our experiments. We evaluate both the Swedish and English embeddings that wecreated using intrinsic evaluation (including analogy & Spearman correlation) and compare them with2 common, publicly available embeddings. Our English continuous Bag-of-Words (CBoW)-negativesampling embedding shows better performance compared to the publicly available GoogleNews version.We also describe the relationship between NLP and cognitive science. We contribute the embeddings forresearch or other useful purposes by publicly releasing them.
In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from far larger corpora and we present the new Swedish analogy test set. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at the intrinsic level and extrinsic level, by deploying them on the Transformer in named entity recognition (NER) task and conduct significance tests. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with far smaller training data, compared to recently released, common crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language.
In this study, we demonstrate that an open-domain conversational system trained on idioms or figurative language generates more fitting responses to prompts containing idioms. Idioms are a part of everyday speech in many languages and across many cultures, but they pose a great challenge for many natural language processing (NLP) systems that involve tasks such as information retrieval (IR), machine translation (MT), and conversational artificial intelligence (AI). We utilized the Potential Idiomatic Expression (PIE)-English idiom corpus for the two tasks that we investigated: classification and conversation generation. We achieved a state-of-the-art (SoTA) result of a 98% macro F1 score on the classification task by using the SoTA T5 model. We experimented with three instances of the SoTA dialogue model—the Dialogue Generative Pre-trained Transformer (DialoGPT)—for conversation generation. Their performances were evaluated by using the automatic metric, perplexity, and a human evaluation. The results showed that the model trained on the idiom corpus generated more fitting responses to prompts containing idioms 71.9% of the time in comparison with a similar model that was not trained on the idiom corpus. We have contributed the model checkpoint/demo/code to the HuggingFace hub for public access.
Word2Vec is a prominent model for natural language processing tasks. Similar inspiration is found in distributed embeddings (word-vectors) in recent state-of-the-art deep neural networks. However, wrong combination of hyperparameters can produce embeddings with poor quality. The objective of this work is to empirically show that Word2Vec optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the publicly released, original Word2Vec embedding. Both intrinsic and extrinsic (downstream) evaluations are carried out, including named entity recognition and sentiment analysis. Our main contributions include showing that the best model is usually task-specific, high analogy scores do not necessarily correlate positively with F1 scores, and performance is not dependent on data size alone. If ethical considerations to save time, energy, and the environment are made, then relatively smaller corpora may do just as well or even better in some cases. Increasing the dimension size of embeddings after a point leads to poor quality or performance. In addition, using a relatively small corpus, we obtain better WordSim scores, corresponding Spearman correlation, and better downstream performances (with significance tests) compared to the original model, which is trained on a 100 billion-word corpus.
Word2Vec is a prominent model for natural language processing (NLP) tasks. Similar nspiration is found in distributed embeddings for new state-of-the-art (SotA) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to empirically show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the released, pre-trained original word2vec model. Both intrinsic and extrinsic (downstream) evaluations, including named entity recognition (NER) and sentiment analysis (SA) were carried out. The downstream tasks reveal that the best model is usually task-specific, high analogy scores don’t necessarily correlate positively with F1 scores and the same applies to focus on data alone. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream performances (with significance tests) compared to the original model, trained on 100 billion-word corpus.
In this paper, the authors propose to increase the efficiency of blockchain mining by using a population-based approach. Blockchain relies on solving difficult mathematical problems as proof-of-work within a network before blocks are added to the chain. Brute force approach, advocated by some as the fastest algorithm for solving partial hash collisions and implemented in Bitcoin blockchain, implies exhaustive, sequential search. It involves incrementing the nonce (number) of the header by one, then taking a double SHA-256 hash at each instance and comparing it with a target value to ascertain if lower than that target. It excessively consumes both time and power. In this paper, the authors, therefore, suggest using an inner for-loop for the population-based approach. Comparison shows that it’s a slightly faster approach than brute force, with an average speed advantage of about 1.67% or 3,420 iterations per second and 73% of the time performing better. Also, we observed that the more the total particles deployed, the better the performance until a pivotal point. Furthermore, a recommendation on taming the excessive use of power by networks, like Bitcoin’s, by using penalty by consensus is suggested.
We conduct relatively extensive investigations of automatic hate speech (HS) detection using different State-of-The-Art (SoTA) baselines across 11 subtasks spanning six different datasets. Our motivation is to determine which of the recent SoTA models is best for automatic hate speech detection and what advantage methods, such as data augmentation and ensemble, may have on the best model, if any. We carry out six cross-task investigations. We achieve new SoTA results on two subtasks—macro F1 scores of 91.73% and 53.21% for subtasks A and B of the HASOC 2020 dataset, surpassing previous SoTA scores of 51.52% and 26.52%, respectively. We achieve near-SoTA results on two others—macro F1 scores of 81.66% for subtask A of the OLID 2019 and 82.54% for subtask A of the HASOC 2021, in comparison to SoTA results of 82.9% and 83.05%, respectively. We perform error analysis and use two eXplainable Artificial Intelligence (XAI) algorithms (Integrated Gradient (IG) and SHapley Additive exPlanations (SHAP)) to reveal how two of the models (Bi-Directional Long Short-Term Memory Network (Bi-LSTM) and Text-to-Text-Transfer Transformer (T5)) make the predictions they do by using examples. Other contributions of this work are: (1) the introduction of a simple, novel mechanism for correcting Out-of-Class (OoC) predictions in T5, (2) a detailed description of the data augmentation methods, and (3) the revelation of the poor data annotations in the HASOC 2021 dataset by using several examples and XAI (buttressing the need for better quality control). We publicly release our model checkpoints and codes to foster transparency.
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. There are a total of 9,000 turns, each language having 1,500 turns, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we benchmark by investigating & analyzing the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.
This paper describes the system used by the Machine Learning Group of LTU in subtask 1 of the SemEval-2022 Task 4: Patronizing and Condescending Language (PCL) Detection. Our system consists of finetuning a pretrained text-to-text transfer transformer (T5) and innovatively reducing its out-of-class predictions. The main contributions of this paper are 1) the description of the implementation details of the T5 model we used, 2) analysis of the successes & struggles of the model in this task, and 3) ablation studies beyond the official submission to ascertain the relative importance of data split. Our model achieves an F1 score of 0.5452 on the official test set.
We survey SoTA open-domain conversational AI models with the objective of presenting the prevailing challenges that still exist to spur future research. In addition, we provide statistics on the gender of conversational AI in order to guide the ethics discussion surrounding the issue. Open-domain conversational AI models are known to have several challenges, including bland, repetitive responses and performance degradation when prompted with figurative language, among others. First, we provide some background by discussing some topics of interest in conversational AI. We then discuss the method applied to the two investigations carried out that make up this study. The first investigation involves a search for recent SoTA open-domain conversational AI models, while the second involves the search for 100 conversational AI to assess their gender. Results of the survey show that progress has been made with recent SoTA conversational AI, but there are still persistent challenges that need to be solved, and the female gender is more common than the male for conversational AI. One main takeaway is that hybrid models of conversational AI offer more advantages than any single architecture. The key contributions of this survey are (1) the identification of prevailing challenges in SoTA open-domain conversational AI, (2) the rarely held discussion on open-domain conversational AI for low-resource languages, and (3) the discussion about the ethics surrounding the gender of conversational AI.
The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation.The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.
Open-domain chatbots have advanced but still have many gaps. My PhD aims to solve a few of those gaps by creating vector representations of idioms (figures of speech) that will be beneficial to chatbots and natural language processing (NLP), generally. In the process, new, optimal fastText embeddings in Swedish and English have been created and the first Swedish analogy test set, larger than the Google original, for intrinsic evaluation of Swedish embeddings has also been produced. Major milestones have been attained and others are soon to follow. The deliverables of this project will give NLP researchers the opportunity to measure the quality of Swedish embeddings easily and advance state-of-the-art (SotA) in NLP.
We investigate five English NLP benchmark datasets (on the superGLUE leaderboard) and two Swedish datasets for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), Recognising Textual Entailment (RTE), Swedish CB, and SWEDN. Bias can be harmful and it is known to be common in data, which ML models learn from. In order to mitigate bias in data, it is crucial to be able to estimate it objectively. We use bipol, a novel multi-axes bias metric with explainability, to estimate and explain how much bias exists in these datasets. Multilingual, multi-axes bias evaluation is not very common. Hence, we also contribute a new, large Swedish bias-labeled dataset (of 2 million samples), translated from the English version and train the SotA mT5 model on it. In addition, we contribute new multi-axes lexica for bias detection in Swedish. We make the codes, model, and new dataset publicly available.
We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors’ knowledge, this is the first idioms corpus with classes of idioms beyond the literal and the general idioms classification. Inparticular, the following classes are labelled in the dataset: metaphor, simile, euphemism, parallelism, personification, oxymoron, paradox, hyperbole, irony and literal. We obtain an overall inter-annotator agreement (IAA) score, between two independent annotators, of 88.89%. Many past efforts have been limited in the corpus size and classes of samples but this dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses). The corpus may also be extended by researchers to meet specific needs. The corpus has part of speech (PoS) tagging from the NLTK library. Classification experiments performed on the corpus to obtain a baseline and comparison among three common models, including the state-of-the-art (SoTA) BERT model, give good results. We also make publicly available the corpus and the relevant codes for working with it for NLP tasks.
This article describes analytical work carried out in a pilot project for the Swedish Space Data Lab (SSDL), which focused on monitoring drought in the Mälardalen region in central Sweden. Normalized Difference Vegetation Index (NDVI) and the Moisture Stress Index (MSI) – commonly used to analyse drought – are estimated from Sentinel 2 satellite data and averaged over a selection of seven grassland areas of interest. To derive a complete time-series over a season that interpolates over days with missing data, we use Gaussian Process Regression, a technique from multivariate Bayesian analysis. The analysis show significant differences at 95% confidence for five out of seven areas when comparing the peak drought period in the dry year 2018 compared to the corresponding period in 2019. A cross-validation analysis indicates that the model parameter estimates are robust for temporal covariance structure (while inconclusive for the spatial dimensions). There were no signs of over-fitting when comparing in-sample and out-of-sample RMSE.
Optical Character Recognition (OCR) of cursive scripts like Pashto and Urdu is difficult due the presence of complex ligatures and connected writing styles. In this paper, we evaluate and compare different approaches for the recognition of such complex ligatures. The approaches include Hidden Markov Model (HMM), Long Short Term Memory (LSTM) network and Scale Invariant Feature Transform (SIFT). Current state of the art in cursive script assumes constant scale without any rotation, while real world data contain rotation and scale variations. This research aims to evaluate the performance of sequence classifiers like HMM and LSTM and compare their performance with descriptor based classifier like SIFT. In addition, we also assess the performance of these methods against the scale and rotation variations in cursive script ligatures. Moreover, we introduce a database of 480,000 images containing 1000 unique ligatures or sub-words of Pashto. In this database, each ligature has 40 scale and 12 rotation variations. The evaluation results show a significantly improved performance of LSTM over HMM and traditional feature extraction technique such as SIFT. Keywords.
Atomic segmentation of cursive scripts into con- stituent characters is one of the most challenging problems in pattern recognition. To avoid segmentation in cursive script, concrete shapes are considered as recognizable units. Therefore, the objective of this work is to find out the alternate recognizable units in Pashto cursive script. These alternatives are ligatures and primary ligatures. However, we need sound statistical analysis to find the appropriate numbers of ligatures and primary ligatures in Pashto script. In this work, a corpus of 2, 313, 736 Pashto words are extracted from a large scale diversified web sources, and total of 19, 268 unique ligatures have been identified in Pashto cursive script. Analysis shows that only 7000 ligatures represent 91% portion of overall corpus of the Pashto unique words. Similarly, about 7, 681 primary ligatures are also identified which represent the basic shapes of all the ligatures.
This paper presents a deep learning benchmark on a complex dataset known as KFUPM Handwritten Arabic TexT (KHATT). The KHATT data-set consists of complex patterns of handwritten Arabic text-lines. This paper contributes mainly in three aspects i.e., (1) pre-processing, (2) deep learning based approach, and (3) data-augmentation. The pre-processing step includes pruning of white extra spaces plus de-skewing the skewed text-lines. We deploy a deep learning approach based on Multi-Dimensional Long Short-Term Memory (MDLSTM) networks and Connectionist Temporal Classification (CTC). The MDLSTM has the advantage of scanning the Arabic text-lines in all directions (horizontal and vertical) to cover dots, diacritics, strokes and fine inflammation. The data-augmentation with a deep learning approach proves to achieve better and promising improvement in results by gaining 80.02% Character Recognition (CR) over 75.08% as baseline.
Recent progress in deep learning has led to accurate and efficient generic object detection networks. Training of highly reliable models depends on large datasets with highly textured and rich images. However, in real-world scenarios, the performance of the generic object detection system decreases when (i) occlusions hide the objects, (ii) objects are present in low-light images, or (iii) they are merged with background information. In this paper, we refer to all these situations as challenging environments. With the recent rapid development in generic object detection algorithms, notable progress has been observed in the field of deep learning-based object detection in challenging environments. However, there is no consolidated reference to cover the state of the art in this domain. To the best of our knowledge, this paper presents the first comprehensive overview, covering recent approaches that have tackled the problem of object detection in challenging environments. Furthermore, we present a quantitative and qualitative performance analysis of these approaches and discuss the currently available challenging datasets. Moreover, this paper investigates the performance of current state-of-the-art generic object detection algorithms by benchmarking results on the three well-known challenging datasets. Finally, we highlight several current shortcomings and outline future directions.
In this paper, we first describe various synchronous and asynchronous methods for enhancing student engagement in big online courses. We showcase the implementation of these methods in the “Introduction to Artificial Intelligence (AI)” course at Luleå University of Technology, which has attracted around 500 students in each of its iterations (twice yearly, since 2019). We also show that these methods can be applied efficiently, in terms of the teaching hours required. With the increase in digitization and student mobility, the demand for improved and personalized content delivery for distance education has also increased. This applies not only in the context of traditional undergraduate education, but also in the context of adult education and lifelong learning. This higher level of demand, however, introduces a challenge, especially as it is typically combined with a shortage of staff and needs for efficient education. This challenge is further amplified by the current pandemic situation, which led to an even bigger risk of student-dropout. To mitigate this risk, as well as to meet the increased demand, we applied various methods for creating engaging interaction in our pedagogy based on Moor’s framework: learner-to-learner, learner-to-instructor, and learner-to-content engagement strategies. The main methods of this pedagogy are as follows: short, and interactive videos, active discussions in topic-based forums, regular live sessions with group discussions, and the introduction of optional content at many points in the course, to address different target groups. In this paper, we show how we originally designed and continuously improved the course, without requiring more than 500 teaching hours per iteration (one hour per enrolled student), while we also managed to increase the successful completion rate of the participants by 10%, and improved student engagement and feedback for the course by 50%. We intend to share a set of best-practices applicable to many other e-learning courses in ICT.
We introduce DeepDIVA: an infrastructure designed to enable quick and intuitive setup of reproducible experiments with a large range of useful analysis functionality. Reproducing scientific results can be a frustrating experience, not only in document image analysis but in machine learning in general. Using DeepDIVA a researcher can either reproduce a given experiment or share their own experiments with others. Moreover, the framework offers a large range of functions, such as boilerplate code, keeping track of experiments, hyper-parameter optimization, and visualization of data and results. To demonstrate the effectiveness of this framework, this paper presents case studies in the area of handwritten document analysis where researchers benefit from the integrated functionality. DeepDIVA is implemented in Python and uses the deep learning framework PyTorch. It is completely open source(1), and accessible as Web Service through DIVAServices(2).
In this work, we introduce a new architectural component to Neural Network (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.
The field of deep learning is experiencing a trend towards producing reproducible research. Nevertheless, it is still often a frustrating experience to reproduce scientific results. This is especially true in the machine learning community, where it is considered acceptable to have black boxes in your experiments. We present DeepDIVA, a framework designed to facilitate easy experimentation and their reproduction. This framework allows researchers to share their experiments with others, while providing functionality that allows for easy experimentation, such as: boilerplate code, experiment management, hyper-parameter optimization, verification of data integrity and visualization of data and results. Additionally, the code of DeepDIVA is well-documented and supported by several tutorials that allow a new user to quickly familiarize themselves with the framework.
We propose a novel approach towards adversarial attacks on neural networks (NN), focusing on tampering the data used for training instead of generating attacks on trained models. Our network-agnostic method creates a backdoor during training which can be exploited at test time to force a neural network to exhibit abnormal behaviour. We demonstrate on two widely used datasets (CIFAR-10 and SVHN) that a universal modification of just one pixel per image for all the images of a class in the training set is enough to corrupt the training procedure of several state-of-the-art deep neural networks, causing the networks to misclassify any images to which the modification is applied. Our aim is to bring to the attention of the machine learning community, the possibility that even learning-based methods that are personally trained on public datasets can be subject to attacks by a skillful adversary.
This paper introduces a new way for text-line extraction by integrating deep-learning based pre-classification and state-of-the-art segmentation methods. Text-line extraction in complex handwritten documents poses a significant challenge, even to the most modern computer vision algorithms. Historical manuscripts are a particularly hard class of documents as they present several forms of noise, such as degradation, bleed-through, interlinear glosses, and elaborated scripts. In this work, we propose a novel method which uses semantic segmentation at pixel level as intermediate task, followed by a text-line extraction step. We measured the performance of our method on a recent dataset of challenging medieval manuscripts and surpassed state-of-the-art results by reducing the error by 80.7%. Furthermore, we demonstrate the effectiveness of our approach on various other datasets written in different scripts. Hence, our contribution is two-fold. First, we demonstrate that semantic pixel segmentation can be used as strong denoising pre-processing step before performing text line extraction. Second, we introduce a novel, simple and robust algorithm that leverages the high-quality semantic segmentation to achieve a text-line extraction performance of 99.42% line IU on a challenging dataset.
Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n -gram statistics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embedding achieved on par F1 scores while decreasing the time and memory requirements by several times compared to the conventional n -gram statistics, e.g., for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, memory reduction was ca. 100 times and train and test speed-ups were over 100 times. Importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting strict dependency between the dimensionality of the representation and n-gram size, thus, opening a room for tradeoffs.
The contribution of this paper is a new strategy of integrating multiple recognition outputs of diverse recognizers. Such an integration can give higher performance and more accurate outputs than a single recognition system. The problem of aligning various Optical Character Recognition (OCR) results lies in the difficulties to find the correspondence on character, word, line, and page level. These difficulties arise from segmentation and recognition errors which are produced by the OCRs. Therefore, alignment techniques are required for synchronizing the outputs in order to compare them. Most existing approaches fail when the same error occurs in the multiple OCRs. If the corrections do not appear in one of the OCR approaches are unable to improve the results.We design a Line-to-Page alignment with edit rules using Weighted Finite-State Transducers (WFST). These edit rules are based on edit operations: insertion, deletion, and substitution. Therefore, an approach is designed using Recurrent Neural Networks with Long Short-Term Memory (LSTM) to predict these types of errors. A Character-Epsilon alignment is designed to normalize the size of the strings for the LSTM alignment. The LSTM returns best voting, especially when the heuristic approaches are unable to vote among various OCR engines. LSTM predicts the correct characters, even if the OCR could not produce the characters in the outputs. The approaches are evaluated on OCR’s output from the UWIII and historical German Fraktur dataset which are obtained from state-of-the-art OCR systems. The experiments shows that the error rate of the LSTM approach has the best performance with around 0.40%, while other approaches are between 1,26% and 2,31%.
This six-volume set LNCS 14804-14809 constitutes the proceedings of the 18th International Conference on Document Analysis and Recognition, ICDAR 2024, held in Athens, Greece, during August 30–September 4, 2024.The total of 144 full papers presented in these proceedings were carefully selected from 263 submissions.The papers reflect topics such as: Document image processing; physical and logical layout analysis; text and symbol recognition; handwriting recognition; document analysis systems; document classification; indexing and retrieval of documents; document synthesis; extracting document semantics; NLP for document understanding; office automation; graphics recognition; human document interaction; document representation modeling and much more.
In this paper, we propose a blended Attention-Connectionist Temporal Classification (CTC) network architecture for a unique script, Amharic, text-image recognition. Amharic is an indigenous Ethiopic script that uses 34 consonant characters with their 7 vowel variants of each and 50 labialized characters which are derived, with a small change, from the 34 consonant characters. The change involves modifying the structure of these characters by adding a straight line, or shortening and/or elongating one of its main legs including the addition of small diacritics to the right, left, top or bottom of the character. Such a small change affects orthographic identities of character and results in shape similarly among characters which are interesting, but challenging task, for OCR research. Motivated with the recent success of attention mechanism on neural machine translation tasks, we propose an attention-based CTC approach which is designed by blending attention mechanism directly within the CTC network. The proposed model consists of an encoder module, attention module and transcription module in a unified framework. The efficacy of the proposed model on the Amharic language shows that attention mechanism allows learning powerful representations by integrating information from different time steps. Our method outperforms state-of-the-art methods and achieves 1.04% and 0.93% of the character error rate on ADOCR test datasets.