Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Mapping voice quality in normal, pathological and synthetic voices
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-0700-7216
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Sustainable development
SDG 3: Good Health and Well-Being
Abstract [en]

Voice quality evaluation is an integral aspect of both clinical and technological applications, encompassing areas such as speech therapy, phonation disorder diagnosis, and text-to-speech (TTS) synthesis. Traditional methods of assessing voice quality are often subjective, relying on auditory-perceptual evaluation scales, which introduce variability and bias. This thesis explores several novel applications for objective voice quality assessment, utilizing voice mapping—a visualization technique that integrates voice range and quality metrics. By plotting acoustic and electroglottographic (EGG) metrics across a plane defined by fundamental frequency (fo) and sound pressure level (SPL), voice mapping enables a comprehensive understanding of vocal characteristics.

This thesis is based on a compilation of five studies, three of which have been published in archive journals and two of which are in revision at this writing. Paper I establishes the foundational relationship between voice metrics and the fo and the SPL using data from individuals with vocal disorders. Paper II extends the methodology by employing clustering techniques to classify phonation types based on a diverse dataset of normophonic adults and children. Paper III applies voice mapping to pre- and post-thyroidectomy recordings, revealing surgery-induced changes in voice quality and range. Paper IV develops a deep learning-based model for predicting EGG signals from acoustic recordings. Paper V demonstrates the utility of voice mapping in evaluating the performance of synthetic TTS voices, indicating its potential for objective, metric-based TTS quality assessment.

This thesis further speaks for the importance of integrating acoustic and EGG metrics to achieve an objective assessment of voice quality. The metrics used in this approach, including acoustic and EGG-based measures, capture aspects of phonation in both the time and frequency domains, enabling detailed characterization of vocal dynamics. The findings demonstrate that voice mapping is effective not only in clinical settings for understanding voicedisorders but also offers a robust framework for evaluating synthetic voices, helping to bridge the gap between perceptual evaluation and quantitative analysis. Future directions include refining clustering methodologies, enhancing EGG prediction accuracy, and expanding the application of voice mapping to broader clinical and technological applications.

Abstract [sv]

Utvärdering av röstkvalitet är en viktig aspekt inom både kliniska och teknologiska tillämpningar och omfattar områden som talterapi, diagnos av fonationsstörningar och text-till-tal (TTS) syntes. Traditionella metoder för att bedöma röstkvalitet är ofta subjektiva och bygger på auditiv-perceptuella utvärderingsskalor, vilket kan leda till variation och partiskhet. Denna avhandling utforskar flera nya tillämpningar för objektiv röstkvalitetsbedömning med hjälp av röstkartläggning – en visualiseringsteknik som integrerar röstomfång och kvalitetsmått. Genom att plotta akustiska och elektroglottografiska (EGG) mått på ett plan definierat av grundtonsfrekvens (fo) och ljudtrycksnivå (SPL), möjliggör röstkartläggning en omfattande förståelse av röstegenskaper.

Denna avhandling bygger på en sammanställning av fem studier, varav tre har publicerats i arkivtidskrifter, och två i skrivande stund är under granskning. Artikel I påvisar det grundläggande beroendet hos röstmått av grundtonsfrekvens och ljudtrycksnivå, genom analys av data från individer med röststörningar. Artikel II utökar metodiken genom att använda klustringstekniker för att klassificera fonationstyper baserat på ett inspelnings-korpus av normofoniska vuxna och barn. Artikel III tillämpar röstkartläggning på inspelningar före och efter tyroidektomi och påvisar kirurgiska förändringar i röstkvalitet och omfång. Artikel IV redogör för en djupinlärningsbaserad modell för att förutspå EGG-signaler från akustiska inspelningar. Artikel V visar nyttan av röstkartläggning vid utvärdering av syntetiska TTS-röster och lyfter fram dess potential för objektiv, måttbaserad bedömning av TTS-kvalitet.

Denna avhandling belyser vikten av att integrera akustiska och EGG-mått för att uppnå en objektiv bedömning av röstkvalitet. De mått som används i denna metodik, inklusive akustiska och EGG-baserade mått, fångar aspekter av fonation i såväl tids- som frekvensdomänen, vilket möjliggör en detaljerad karakterisering av röstens dynamik. Resultaten visar att röstkartläggning är effektiv inte bara i kliniska miljöer för att förstå röststörningar utan också erbjuder en robust ram för utvärdering av syntetiska röster. Röstkartor bidrar till synergin mellan perceptuell och kvantitativ röstanalys. Framtida forskningsinriktningar inkluderar förbättring av klustringsmetoder, ökad noggrannhet i EGG-prediktion och utvidgning av röstkartläggning till bredare kliniska och teknologiska tillämpningar.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. , p. 72
Series
TRITA-EECS-AVL ; 2025:25
Keywords [en]
Voice quality, voice mapping, text-to-speech (TTS), electroglottography (EGG), acoustic metrics
Keywords [sv]
Röstkvalitet, röstkartläggning, text-till-tal (TTS), elektroglottografi (EGG), akustiska mått
National Category
Signal Processing
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-360211ISBN: 978-91-8106-206-9 (print)OAI: oai:DiVA.org:kth-360211DiVA, id: diva2:1939381
Public defence
2025-03-14, https://kth-se.zoom.us/j/61856204062?pwd=0aLP1ptM9OMUaaXUFuSBxV6bbu74iO.1, Rum B:218, Q2, Malvinas Väg 10, KTH Campus, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20250224

Available from: 2025-02-24 Created: 2025-02-21 Last updated: 2025-03-05Bibliographically approved
List of papers
1. Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment
Open this publication in new window or tab >>Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment
Show others...
2024 (English)In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 67, no 6, p. 1660-1681Article in journal (Refereed) Published
Abstract [en]

Purpose: Literature suggests a dependency of the acoustic metrics, smoothed cepstral peak prominence (CPPS) and harmonics-to-noise ratio (HNR), on human voice loudness and fundamental frequency (fo). Even though this has been explained with different oscillatory patterns of the vocal folds, so far, it has not been specifically investigated. In the present work, the influence of three elicitation levels, calibrated sound pressure level (SPL), fo and vowel on the electroglottographic (EGG) and time-differentiated EGG (dEGG) metrics hybrid open quotient (OQ), dEGG OQ and peak dEGG, as well as on the acous-tic metrics CPPS and HNR, was examined, and their suitability for voice assess-ment was evaluated. Method: In a retrospective study, 29 women with a mean age of 25 years (± 8.9, range: 18–53) diagnosed with structural vocal fold pathologies were examined before and after voice therapy or phonosurgery. Both acoustic and EGG signals were recorded simultaneously during the phonation of the sustained vowels /ɑ/, /i/, and /u/ at three elicited levels of loudness (soft/comfortable/loud) and unconstrained fo conditions. Results: A linear mixed-model analysis showed a significant effect of elicitation effort levels on peak dEGG, HNR, and CPPS (all p < .01). Calibrated SPL significantly influenced HNR and CPPS (both p < .01). Furthermore, F0had asignificant effect on peak dEGG and CPPS (p < .0001). All metrics showed significant changes with regard to vowel (all p < .05). However, the treatment had no effect on the examined metrics, regardless of the treatment type (surgery vs. voice therapy). Conclusions: The value of the investigated metrics for voice assessment purposes when sampled without sufficient control of SPL and fo is limited, in that they are significantly influenced by the phonatory context, be it speech or elicited sustained vowels. Future studies should explore the diagnostic value of new data collation approaches such as voice mapping, which take SPL and fo effects into account.

Place, publisher, year, edition, pages
American Speech Language Hearing Association, 2024
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346605 (URN)10.1044/2024_JSLHR-23-00253 (DOI)001245110000002 ()38758676 (PubMedID)2-s2.0-85192238446 (Scopus ID)
Note

QC 20240703

Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2025-02-21Bibliographically approved
2. Mapping Phonation Types by Clustering of Multiple Metrics
Open this publication in new window or tab >>Mapping Phonation Types by Clustering of Multiple Metrics
2022 (English)In: Applied Sciences, ISSN 2076-3417, Vol. 12, no 23, p. 12092-Article in journal (Refereed) Published
Abstract [en]

For voice analysis, much work has been undertaken with a multitude of acoustic and electroglottographic metrics. However, few of these have proven to be robustly correlated with physical and physiological phenomena. In particular, all metrics are affected by the fundamental frequency and sound level, making voice assessment sensitive to the recording protocol. It was investigated whether combinations of metrics, acquired over voice maps rather than with individual sustained vowels, can offer a more functional and comprehensive interpretation. For this descriptive, retrospective study, 13 men, 13 women, and 22 children were instructed to phonate on /a/ over their full voice range. Six acoustic and EGG signal features were obtained for every phonatory cycle. An unsupervised voice classification model created feature clusters, which were then displayed on voice maps. It was found that the feature clusters may be readily interpreted in terms of phonation types. For example, the typical intense voice has a high peak EGG derivative, a relatively high contact quotient, low EGG cycle-rate entropy, and a high cepstral peak prominence in the voice signal, all represented by one cluster centroid that is mapped to a given color. In a transition region between the non-contacting and contacting of the vocal folds, the combination of metrics shows a low contact quotient and relatively high entropy, which can be mapped to a different color. Based on this data set, male phonation types could be clustered into up to six categories and female and child types into four. Combining acoustic and EGG metrics resolved more categories than either kind on their own. The inter- and intra-participant distributional features are discussed.

Place, publisher, year, edition, pages
MDPI AG, 2022
Keywords
voice analysis, voice range profile, clustering, phonation, phonation type
National Category
Medical Laboratory Technologies Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-322053 (URN)10.3390/app122312092 (DOI)000910824700001 ()2-s2.0-85142534999 (Scopus ID)
Funder
KTH Royal Institute of Technology, CSC-2020-2009
Note

QC 20230214

Available from: 2022-11-29 Created: 2022-11-29 Last updated: 2025-02-21Bibliographically approved
3. Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps
Open this publication in new window or tab >>Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps
2024 (English)In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed) Epub ahead of print
Abstract [en]

Objectives: This study aims to explore the effects of thyroidectomy—a surgical intervention involving the removal of the thyroid gland—on voice quality, as represented by acoustic and electroglottographic measures. Given the thyroid gland's proximity to the inferior and superior laryngeal nerves, thyroidectomy carries a potential risk of affecting vocal function. While earlier studies have documented effects on the voice range, few studies have looked at voice quality after thyroidectomy. Since voice quality effects could manifest in many ways, that a priori are unknown, we wish to apply an exploratory approach that collects many data points from several metrics.

Methods: A voice-mapping analysis paradigm was applied retrospectively on a corpus of spoken and sung sentences produced by patients who had thyroid surgery. Voice quality changes were assessed objectively for 57 patients prior to surgery and 2 months after surgery, by making comparative voice maps, pre- and post-intervention, of six acoustic and electroglottographic (EGG) metrics.

Results: After thyroidectomy, statistically significant changes consistent with a worsening of voice quality were observed in most metrics. For all individual metrics, however, the effect sizes were too small to be clinically relevant. Statistical clustering of the metrics helped to clarify the nature of these changes. While partial thyroidectomy demonstrated greater uniformity than did total thyroidectomy, the type of perioperative damage had no discernible impact on voice quality.ConclusionsChanges in voice quality after thyroidectomy were related mostly to increased phonatory instability in both the acoustic and EGG metrics. Clustered voice metrics exhibited a higher correlation to voice complaints than did individual voice metrics.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
thyroidectomy, voice quality, electroglottography, voice classification, voice mapping
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346224 (URN)10.1016/j.jvoice.2024.03.012 (DOI)2-s2.0-85192255370 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240508

Available from: 2024-05-07 Created: 2024-05-07 Last updated: 2025-02-21Bibliographically approved
4. A WaveNet-Based Model for Predicting the Electroglottographic Signal from the Acoustic Voice Signal
Open this publication in new window or tab >>A WaveNet-Based Model for Predicting the Electroglottographic Signal from the Acoustic Voice Signal
(English)Manuscript (preprint) (Other academic)
National Category
Signal Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-360206 (URN)
Note

QC 20250220

Available from: 2025-02-20 Created: 2025-02-20 Last updated: 2025-02-21Bibliographically approved
5. Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
Open this publication in new window or tab >>Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
(English)Manuscript (preprint) (Other academic)
Abstract [en]

This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and contemporary ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. Voice mapping visualization effectively captures subtle quality differences across the voice range.

Keywords
TTS, voice mapping, voice quality, CPPs, objective evaluation
National Category
Signal Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-360209 (URN)
Note

This is a manuscript under review in EuraSIP Journal on Audio, Speech, and Music Processing.

QC 20250220

Available from: 2025-02-20 Created: 2025-02-20 Last updated: 2025-02-26Bibliographically approved

Open Access in DiVA

Kappa(2642 kB)258 downloads
File information
File name FULLTEXT01.pdfFile size 2642 kBChecksum SHA-512
e3c99bd6023daac59d598e6b82d1f37ab82d8122f61fe564d0bf098952815a172c4a5032c54192698a265d0e38b9d0b6b2da75158b29c6b84247d0f0e9e7d9cf
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Cai, Huanchen
By organisation
Speech, Music and Hearing, TMH
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 261 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1131 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf