Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving Quality of Service in Baseband Speech Communication
KTH, School of Electrical Engineering (EES), Communication Theory.
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Speech is the most important communication modality for human interaction. Automatic speech recognition and speech synthesis have extended further the relevance of speech to man-machine interaction. Environment noise and various distortions, such as reverberation and speech processing artifacts, reduce the mutual information between the message modulated inthe clean speech and the message decoded from the observed signal. This degrades intelligibility and perceived quality, which are the two attributes associated with quality of service. An estimate of the state of these attributes provides important diagnostic information about the communication equipment and the environment. When the adverse effects occur at the presentation side, an objective measure of intelligibility facilitates speech signal modification for improved communication.

The contributions of this thesis come from non-intrusive quality assessment and intelligibility-enhancing modification of speech. On the part of quality, the focus is on predictor design for limited training data. Paper A proposes a quality assessment model for bounded-support ratings that learns efficiently from a limited amount of training data, scales easily with the sampling frequency, and provides a platform for modeling variations in the individual subjective ratings. The predictive performance of the model for the mean of the subjective quality ratings compares favorably to the state-of-art in the field. Patterns in the spread of the individual ratings are captured in the feature space of the training data.

Paper B focuses on enhancing predictive performance for the mean of the quality variable when the signal feature space is sparsely sampled by the training data. Using a Gaussian Processes framework, the deterministic signal-based feature set is augmented with a stochastic feature that is hypothesized to be jointly distributed with the target quality rating. An uncertainty propagation mechanism ensures that the variance of this feature is reflected in the prediction. The proposed architecture can take advantage of i) data that cannot be pooled due to subjective test protocol incompatibility and ii) models trained on data that are no longer available.

With respect to intelligibility enhancement, a hierarchical perspective of the speech communication process, extended from foundational work in the field, is used in paper C to create a unified framework for method analysis and comparison. A high-level intelligibility measure related to the probability for correct recognition is derived using a hit-or-miss distortion criterion in the transcription domain. The measure is used to optimize two speech modifications at different levels of the message encoding hierarchy leading to significantly enhanced intelligibility in noise. The conceptual novelty of the method comes at the cost of higher complexity and the requirement for additional information including message transcription, sound segmentation, and a model of speech.

Mapping the high-level measure to a lower level takes away the need for additional information and preserves asymptotically high-level optimality. Two methods are proposed to reduce degradation in the accuracy of the spectral dynamics due to additive noise. The focus of paper D is dynamics preservation in a range that is lower-bounded by an optimal band-power threshold. The performance of the method is competitive but allows for improvement in power efficiency. This issue is addressed in paper E which proposes and optimizes a distortion measure for spectral dynamics leading to a significant increase in intelligibility. Use of functional optimization techniques allows for families of solutions, among which are dynamic range compressors adaptive to the statistics of the speech and the noise.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. , xii, 38 p.
National Category
Telecommunications
Research subject
Speech and Music Communication; SRA - ICT
Identifiers
URN: urn:nbn:se:kth:diva-145547ISBN: 978-91-7595-181-2 (print)OAI: oai:DiVA.org:kth-145547DiVA: diva2:718664
Public defence
2014-06-12, L1, Drottning Kristinas väg 30, KTH, Stockholm, 09:00 (English)
Opponent
Supervisors
Note

QC 20140523

Available from: 2014-05-23 Created: 2014-05-21 Last updated: 2014-05-23Bibliographically approved
List of papers
1. Probabilistic non-intrusive quality assessment of speech for bounded-scale preference scores
Open this publication in new window or tab >>Probabilistic non-intrusive quality assessment of speech for bounded-scale preference scores
2010 (English)In: 2010 2nd International Workshop on Quality of Multimedia Experience, 2010, Vol. QoMEX 2010 - Proceedings, 188-193 p.Conference paper, Published paper (Refereed)
Abstract [en]

We propose a probabilistic, non-intrusive method for quality assessment of speech that takes into consideration the bounded character of the preference scores. The quality ratings are modeled as iid Beta random variables, whose mean and precision are parametrized directly in terms of the signal features. Maximum likelihood estimation is used to learn the model parameters in view of a training database. Given a valuation point, the proposed model produces a distribution over the range of allowed quality ratings, which can be used to evaluate the statistics of interest. The model performance, in terms of correlation and root mean square error, compares favorably to the state-of-the-art in the field. Low computational complexity in training and prediction make the model attractive for a wide range of applications. The usage of band-based features in the feature set facilitates extension of the proposed model to input signals with larger bandwidth.

Keyword
Beta regression, Maximum likelihood, Non-intrusive quality assessment, Feature sets, Input signal, Model parameters, Model performance, Non-intrusive, Non-intrusive method, Quality assessment, Quality ratings, Root mean square errors, Signal features, Training database, Computational complexity, Random variables, Maximum likelihood estimation
National Category
Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-36316 (URN)10.1109/QOMEX.2010.5516236 (DOI)2-s2.0-77955761725 (Scopus ID)9781424469604 (ISBN)
Conference
2010 2nd International Workshop on Quality of Multimedia Experience, QoMEX 2010; Trondheim
Note
QC 20110712Available from: 2011-07-12 Created: 2011-07-11 Last updated: 2014-05-23Bibliographically approved
2. Feature set augmentation for enhancing the performance of a non-intrusive quality predictor
Open this publication in new window or tab >>Feature set augmentation for enhancing the performance of a non-intrusive quality predictor
2012 (English)In: 2012 4th International Workshop on Quality of Multimedia Experience, QoMEX 2012, IEEE , 2012, 121-126 p.Conference paper, Published paper (Refereed)
Abstract [en]

A non-intrusive quality predictor constitutes a mapping from signal features to a (typically one dimensional) representation of the perceived quality. Assuming that the regression model performing the mapping is suited to the data, the performance of the predictor largely depends on how well the parameters of this regression model can be inferred from the training data. In situations where the training data is scarce, model performance is degraded due to over-fitting. The effects of over-fitting can be mitigated by feature selection but the model performance remains low due to the insufficiently representative training data. The objective we pursue is to enhance the performance of a quality predictor by augmenting the feature set with the output of a pre-trained quality predictor. This approach introduces an implicit dependence of the regression model parameters on a larger amount of training data. In view of the increasing usage of speech signals with higher bandwidth, and the dearth of training data for such signals, an augmentation of particular interest is that of a wide-band feature set with a narrow-band quality prediction. Experimental results for additive noise and non-linear distortions encountered in hearing aids, using quality labels from an intrusive quality predictor, illustrate the performance enhancement capabilities of the proposed approach.

Place, publisher, year, edition, pages
IEEE, 2012
Keyword
input uncertainty, machine learning with Gaussian processes, Non-intrusive quality assessment
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-105308 (URN)10.1109/QoMEX.2012.6263856 (DOI)2-s2.0-84866646261 (Scopus ID)978-146730725-3 (ISBN)
Conference
2012 4th International Workshop on Quality of Multimedia Experience, QoMEX 2012, 5 July 2012 through 7 July 2012, Melbourne, VIC
Funder
ICT - The Next Generation
Note

QC 20121120

Available from: 2012-11-20 Created: 2012-11-20 Last updated: 2014-05-23Bibliographically approved
3. Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise
Open this publication in new window or tab >>Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise
2013 (English)In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 21, no 5, 1035-1045 p.Article in journal (Refereed) Published
Abstract [en]

An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.

Keyword
environment adaptation, intelligibility enhancement, speech pre-enhancement
National Category
Signal Processing
Research subject
SRA - ICT
Identifiers
urn:nbn:se:kth:diva-113210 (URN)10.1109/TASL.2013.2244089 (DOI)000315287500006 ()2-s2.0-84873926312 (Scopus ID)
Projects
Listening Talker (LISTA)
Funder
EU, European Research Council, 256230
Note

QC 20130325. Updated from accepted to published.

Available from: 2013-01-14 Created: 2013-01-14 Last updated: 2017-12-06Bibliographically approved
4. Preservation of Speech Spectral Dynamics Enhances Intelligibility
Open this publication in new window or tab >>Preservation of Speech Spectral Dynamics Enhances Intelligibility
2013 (English)In: Proc. Interspeech, 2013,, 2013, 3597-3601 p.Conference paper, Published paper (Other academic)
Abstract [en]

We propose a method for the enhancement of intelligibility in scenarios where speech is rendered in a noisy environment. The method is based on the hypothesis that intelligibility is a monotonic function of the degree of preservation of the speech spectral dynamics. The accuracy of the speech spectral dynamics can then be traded against the power of the rendered speech signal. We can either maximize the dynamics accuracy given the signal power, or minimize the signal power given the dynamics accuracy. In our implementation, the spectral dynamics is quantified as the difference of the mel cepstra between time frames of the speech signal. We compared the speech rendered by our implementation against both natural speech and a reference method, for the scenario where signal power is minimized given a target dynamics accuracy, and observed a significantly improved intelligibility. The low system delay, and the low complexity and memory requirements make the new method particularly suitable for real-time applications.

Keyword
speech intelligibility, spectral dynamics
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-145640 (URN)2-s2.0-84906213507 (Scopus ID)
Conference
Interspeech 2013
Note

QC 20140523

Available from: 2014-05-23 Created: 2014-05-23 Last updated: 2014-05-23Bibliographically approved
5. Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise
Open this publication in new window or tab >>Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise
2015 (English)In: IEEE/ACM Transactions on Speech and Language Processing, ISSN 2329-9290, Vol. 23, no 2, 327-338 p.Article in journal (Refereed) Published
Abstract [en]

Speech intelligibility in noisy environments decreases with an increase in the noise power. We hypothesize that the differences of subsequent short-term spectra of speech, which we collectively refer to as the speech spectral dynamics, can be used to characterize speech intelligibility. We propose a distortion measure to characterize the deviation of the dynamics of the noisy modified speech from the dynamics of natural speech. Optimizing this distortion measure, we derive a parametric relationship between the signal band-power before and after modification. The parametric nature of the solution ensures adaptation to the noise level, the speech statistics and a penalty on the power gain. A multi-band speech modification system based on the single-band optimal solution is designed under a total signal power constraint and evaluated in selected noise conditions. The results indicate that the proposed approach compares favorably to a reference method based on optimizing a measure of the speech intelligibility index. Very low computational complexity and high intelligibility gain make this an attractive approach for speech modification in a wide range of application scenarios.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
Keyword
Environment adaptation, speech intelligibility enhancement, speech modification
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-145641 (URN)10.1109/TASLP.2014.2384271 (DOI)000348210300009 ()2-s2.0-84921651956 (Scopus ID)
Note

Updated from "Pre-print" to "Article" QC 20150227

Available from: 2014-05-23 Created: 2014-05-23 Last updated: 2017-01-11Bibliographically approved

Open Access in DiVA

Thesis(600 kB)