Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Found speech and humans in the loop: Ways to gain insight into large quantities of speech
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-1262-4876
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

Abstract [sv]

Funnet data - data som används för något annat än det syfte som det först samlades in för - är värdefullt i många avseenden. Det reflekterar vanligtvis hög ekologisk validitet, det har ett starkt kulturellt värde, och det finns stora mängder att ta del av. Det är dock fyllt av brus, svårt att få en överblick av, och ofta är innehållet inte tydligt. Denna avhandling utforskar metoder som ger insikt i dessa datasamlingar, specifikt vad gäller tal och ljud.

På senare tid har djupinlärning producerat oöverträffade resultat inom tal och språkteknologi. Många av dessa metoder behöver dock väldiga mängder annoterat data, vilket är kostsamt att skapa. Dessutom är maskininlärningsmodeller vanligtvis tränade med väldefinierade problem i åtanke, och presterar sämre inom mer generella uppgifter - såsom att tillhandahålla en övergripande beskrivning av innehållet i en stor ljudfil. Denna observation visar på en brist inom existerande metodologier, således finns det ett behov av vidare tekniker vilket är något som denna avhandling syftar till att täcka.

Ett idealt angreppsätt för dessa problem kombinerar flexibiliteten och den generella intelligensen hos en människa med beräkningskraften och mönsterigenkänningsförmågan hos en maskin. Utifrån dessa idéer utforskar avhandlingen värdet av att inkludera människan i loopen, specifikt utifrån hur insikter om stora insamlingar av funnet tal kan skapas. Huvudidén är således att kombinera tekniker från talteknologi, maskininlärningsparadigm, samt människa-i-loopen-metoder, med det övergripande målet att utveckla och utvärdera nya metoder för utforskandet av stora mängder funnet taldata.

Ett primärt bidrag är Edyson, ett verktyg för snabb genomlyssning och annotering av ljud. Det bygger på tidsmässig isärtagning av ljud i kombination med särdragsextraktion, dimensionsreduceringsalgoritmer, samt en flexibel lyssningsfunktion, vilket ger en användare en informativ överblick av innehållet.

Vidare undersöks crowdsourcing inom kontexten av storskaliga perceptionsstudier och datainsamling av tal och språkdata. Tidigare rapporter som visar på användbarheten av crowd workers är styrkta av avhandlingens bidrag.

Avhandlingsbidragen visar att de undersökta metoderna är lovande alternativ för utforskandet av stora mängder funnet ljuddata och förtjänar vidare uppmärksamhet.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2022. , p. 83
Series
TRITA-EECS-AVL ; 2022:13
Keywords [en]
Found data, found speech, human-in-the-loop, sound browsing, dimensionality reduction, visualization, crowdsourcing
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-309031ISBN: 978-91-8040-143-2 (print)OAI: oai:DiVA.org:kth-309031DiVA, id: diva2:1639119
Public defence
2022-03-18, Kollegiesalen, https://kth-se.zoom.us/j/62813774919, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20220222

Available from: 2022-02-22 Created: 2022-02-18 Last updated: 2022-06-25Bibliographically approved
List of papers
1. Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data
Open this publication in new window or tab >>Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data
2019 (English)In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper, Published paper (Refereed)
Abstract [en]

We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2019
Keywords
Data visualisation, Found data, Speech archives
National Category
Media Engineering
Identifiers
urn:nbn:se:kth:diva-241799 (URN)000725545004063 ()2-s2.0-85059880464 (Scopus ID)
Conference
11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018
Note

Part of proceedings: ISBN 979-10-95546-00-9

QC 20230206

Available from: 2019-01-25 Created: 2019-01-25 Last updated: 2023-02-06Bibliographically approved
2. Towards fast browsing of found audio data: 11 presidents
Open this publication in new window or tab >>Towards fast browsing of found audio data: 11 presidents
2019 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 133-142Conference paper, Published paper (Refereed)
Abstract [en]

Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.

Place, publisher, year, edition, pages
CEUR-WS, 2019
Keywords
Dimensionality reduction, Found data, Self-organizing maps, Speech processing, Visualisation, Conformal mapping, Visualization, Audio data, Proof of concept, Soundscapes, Spread outs, Self organizing maps
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-280816 (URN)2-s2.0-85066015671 (Scopus ID)
Conference
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, 5-8 March 2019, Copenhagen, Denmark
Note

QC 20200914

Available from: 2020-09-14 Created: 2020-09-14 Last updated: 2022-06-25Bibliographically approved
3. How to annotate 100 hours in 45 minutes
Open this publication in new window or tab >>How to annotate 100 hours in 45 minutes
2019 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper, Published paper (Refereed)
Abstract [en]

Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

Place, publisher, year, edition, pages
ISCA, 2019
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-268304 (URN)10.21437/Interspeech.2019-1648 (DOI)000831796400069 ()2-s2.0-85074718085 (Scopus ID)
Conference
Interspeech 2019 15-19 September 2019, Graz
Note

QC 20200310

Available from: 2020-03-10 Created: 2020-03-10 Last updated: 2022-09-23Bibliographically approved
4. Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
Open this publication in new window or tab >>Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
2021 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2021, p. 3685-3689Conference paper, Published paper (Refereed)
Abstract [en]

Edyson is a human-in-the-loop (HITL) tool for browsing and annotating large amounts of audio data quickly. It builds on temporally disassembled audio and massively multi-component audio environments to overcome the cumbersome time con- straints that come with linear exploration of large audio data. This study adds the following contributions to Edyson: 1) We add the new use case of HITL binary classification by sample; 2) We explore the new domain oceanic hydrophone recordings with whale song, along with speech activity detection in noisy audio; 3) We propose a repeatable method of analysing the effi- ciency of HITL in Edyson for binary classification, specifically designed to measure the return on human time spent in a given domain. We exemplify this method on two domains, and show that for a manageable initial cost in terms of HITL, it does dif- ferentiate between suitable and unsuitable domains for our new use case - a valuable insight when working with large collections of audio.

Place, publisher, year, edition, pages
International Speech Communication Association, 2021
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X ; 6
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-309024 (URN)10.21437/interspeech.2021-45 (DOI)000841879503159 ()2-s2.0-85119260766 (Scopus ID)
Conference
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 30 August 2021, through 3 September 2021
Note

QC 20221108

Part of proceedings: ISBN 978-171383690-2

Available from: 2022-02-18 Created: 2022-02-18 Last updated: 2022-11-08Bibliographically approved
5. Edyson: rapid human-in-the-loop browsing, exploration and annotation of large speech and audio data
Open this publication in new window or tab >>Edyson: rapid human-in-the-loop browsing, exploration and annotation of large speech and audio data
(English)Manuscript (preprint) (Other academic)
Abstract [en]

The audio exploration tool Edyson integrates a variety of techniques to achieve the efficient exploration of otherwise prohibitively large collections of speech and other sounds. A main strength is that this combination of techniques allows us to place a human-in-the-loop in a coherent and operationalised manner. 

The two most prominent techniques that we incorporate are temporally dis- assembled audio (TDA) and massively multi-component audio environments (MMAE). The first allows us to decouple input audio from the temporal dimen- sion by segmenting it into sound snippets of short duration, akin to the frames used in signal processing. These snippets are organised and visualised in an interactive interface where the investigator can navigate through the snippets freely while providing labels and judgements that are not tied to the tempo- ral context of the original audio. This, in turn, removes the real-time or near real-time requirement associated with temporally linear audio browsing. 

We further argue that a human-in-the-loop inclusion, as opposed to fully automated black-box approaches, is valuable and perhaps necessary to understand and fully exploit larger quantities of found speech. 

We describe in this paper the details of the tool and its underlying method- ologies, and provide a summary of results and findings that has come out of our efforts to validate and quantify the characteristics of this new type of audio browsing to date. 

National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-309026 (URN)
Note

QC 20220314

Available from: 2022-02-18 Created: 2022-02-18 Last updated: 2022-06-25Bibliographically approved
6. The audio cocktail as a sound browsing tool - a crowdsourcing based validation
Open this publication in new window or tab >>The audio cocktail as a sound browsing tool - a crowdsourcing based validation
(English)Manuscript (preprint) (Other academic)
Abstract [en]

We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-309029 (URN)
Note

QC 20220310

Available from: 2022-02-18 Created: 2022-02-18 Last updated: 2022-06-25Bibliographically approved
7. Crowdsourcing a self-evolving dialog graph
Open this publication in new window or tab >>Crowdsourcing a self-evolving dialog graph
Show others...
2019 (English)In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2019
Series
ACM International Conference Proceeding Series
Keywords
Crowdsourcing, Datasets, Dialog systems, Human-computer interaction
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-266061 (URN)10.1145/3342775.3342790 (DOI)000525446900014 ()2-s2.0-85075882531 (Scopus ID)9781450371872 (ISBN)
Conference
1st International Conference on Conversational User Interfaces, CUI 2019; Dublin; Ireland; 22 August 2019 through 23 August 2019
Note

QC 20200114

Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2024-03-15Bibliographically approved

Open Access in DiVA

Kappa(11582 kB)364 downloads
File information
File name FULLTEXT03.pdfFile size 11582 kBChecksum SHA-512
7bbefc749d712bc5b0fc65c7aaf70f360b10505d3b46db958b693e331121f801ae737e4712b1108f9ad55a900b1d29e85a458c6338c48dc8592703f2f4b9e177
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Fallgren, Per
By organisation
Speech Communication and Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 364 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1470 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf