From Acoustics to Articulation: Study of the acoustic-articulatory relationship along with methods to normalize and adapt to variations in production across different speakers
2011 (English)Doctoral thesis, monograph (Other academic)
The focus of this thesis is the relationship between the articulation ofspeech and the acoustics of produced speech. There are several problems thatare encountered in understanding this relationship, given the non-linearity,variance and non-uniqueness in the mapping, as well as the differences thatexist in the size and shape of the articulators, and consequently the acoustics,for different speakers. The thesis covers mainly four topics pertaining to thearticulation and acoustics of speech.The first part of the thesis deals with variations among different speakersin the articulation of phonemes. While the speakers differ physically in theshape of their articulators and vocal tracts, the study tries to extract articula-tion strategies that are common to different speakers. Using multi-way linearanalysis methods, the study extracts articulatory parameters which can beused to estimate unknown articulations of phonemes made by one speaker;knowing other articulations made by the same speaker and those unknown ar-ticulations made by other speakers of the language. At the same time, a novelmethod to select the number of articulatory model parameters, as well as thearticulations that are representative of a speaker’s articulatory repertoire, issuggested.The second part is devoted to the study of uncertainty in the acoustic-to-articulatory mapping, specifically non-uniqueness in the mapping. Severalstudies in the past have shown that human beings are capable of producing agiven phoneme using non-unique articulatory configurations, when the artic-ulators are constrained. This was also demonstrated by synthesizing soundsusing theoretical articulatory models. The studies in this part of the the-sis investigate the existence of non-uniqueness in unconstrained read speech.This is carried out using a database of acoustic signals recorded synchronouslyalong with the positions of electromagnetic coils placed on selected points onthe lips, jaws, tongue and velum. This part, thus, largely devotes itself todescribing techniques that can be used to study non-uniqueness in the sta-tistical sense, using such a database. The results indicate that the acousticvectors corresponding to some frames in all the phonemes in the databasecan be mapped onto non-unique articulatory distributions. The predictabil-ity of these non-unique frames is investigated, along with verifying whetherapplying continuity constraints can resolve this non-uniqueness.The third part proposes several novel methods of looking at acoustic-articulatory relationships in the context of acoustic-to-articulatory inversion.The proposed methods include explicit modeling of non-uniqueness usingcross-modal Gaussian mixture modeling, as well as modeling the mappingas local regressions. Another innovative approach towards the mapping prob-lem has also been described in the form of relating articulatory and acousticgestures. Definitions and methods to obtain such gestures are presented alongwith an analysis of the gestures for different phoneme types. The relationshipbetween the acoustic and articulatory gestures is also outlined. A method toconduct acoustic-to-articulatory inverse mapping is also suggested, along withva method to evaluate it. An application of acoustic-to-articulatory inversionto improve speech recognition is also described in this part of the thesis.The final part of the thesis deals with problems related to modeling infantsacquiring the ability to speak; the model utilizing an articulatory synthesizeradapted to infant vocal tract sizes. The main problem addressed is related tomodeling how infants acquire acoustic correlates that are normalized betweeninfants and adults. A second problem of how infants decipher the number ofdegrees of articulatory freedom is also partially addressed. The main contri-bution is a realistic model which shows how an infant can learn the mappingbetween the acoustics produced during the babbling phase and the acous-tics heard from the adults. The knowledge required to map correspondingadult-infant speech sounds is shown to be learnt without the total numberof categories or one-one correspondences being specified explicitly. Instead,the model learns these features indirectly based on an overall approval rating,provided by a simulation of adult perception, on the basis of the imitation ofadult utterances by the infant model.Thus, the thesis tries to cover different aspects of the relationship betweenarticulation and acoustics of speech in the context of variations for differentspeakers and ages. Although not providing complete solutions, the thesis pro-poses novel directions for approaching the problem, with pointers to solutionsin some contexts.
Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2011. , xxx, 250 p.
Trita-CSC-A, ISSN 1653-5723 ; 2011:23
Acoustic-Articulatory relationship, speaker normalization
Signal Processing Media and Communication Technology
IdentifiersURN: urn:nbn:se:kth:diva-52899ISBN: 978-91-7501-215-5OAI: oai:DiVA.org:kth-52899DiVA: diva2:468190
2012-01-27, F3, Lindstedtsvägen 26, KTH, Stockholm, 10:00 (English)
Bailly, Gérard, Director of Research
Engwall, Olov, Docent
ProjectsComputer-Animated language Teachers (CALATea), Audio-Visual Speech Inversion (ASPI)
FunderEU, European Research Council, 021324Swedish Research Council, 80449001
QC 201112222011-12-222011-12-202011-12-22Bibliographically approved