Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. (Tal)
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Parallel with the orthographic streams of words in conversation are multiple layered epiphenomena, short in duration and with a communicativepurpose. These paralinguistic events regulate the interaction flow via gaze,gestures and intonation. This thesis focus on how to compute, model, discoverand analyze prosody and it’s applications for spoken dialog systems.Specifically it addresses automatic classification and analysis of conversationalcues related to turn-taking, brief feedback, affective expressions, their crossrelationshipsas well as their cognitive and neurological basis. Techniques areproposed for instantaneous and suprasegmental parameterization of scalarand vector valued representations of fundamental frequency, but also intensity and voice quality. Examples are given for how to engineer supervised learned automata’s for off-line processing of conversational corpora as well as for incremental on-line processing with low-latency constraints suitable as detector modules in a responsive social interface. Specific attention is given to the communicative functions of vocal feedback like "mhm", "okay" and "yeah, that’s right" as postulated by the theories of grounding, emotion and a survey on laymen opinions. The potential functions and their prosodic cues are investigated via automatic decoding, data-mining, exploratory visualization and descriptive measurements.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2012. , xiv, 86 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 2012:08
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:kth:diva-102335ISBN: 978-91-7501-467-8 (print)OAI: oai:DiVA.org:kth-102335DiVA: diva2:552376
Public defence
2012-09-28, Sal F3, Lindstedtsvägen 26, KTH, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20120914

Available from: 2012-09-14 Created: 2012-09-14 Last updated: 2012-09-14Bibliographically approved
List of papers
1. Tracking pitch contours using minimum jerk trajectories
Open this publication in new window or tab >>Tracking pitch contours using minimum jerk trajectories
2011 (English)In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, 2056-2059 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

Keyword
pitch tracking, Constant-Q, Swedish accent II
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52192 (URN)000316502201003 ()2-s2.0-84865794085 (Scopus ID)978-1-61839-270-1 (ISBN)
Conference
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. 28-31 August 2011
Note

tmh_import_11_12_14. QC 20111222

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2014-01-16Bibliographically approved
2. Exploring the implications for feedback of a neurocognitive theory of overlapped speech
Open this publication in new window or tab >>Exploring the implications for feedback of a neurocognitive theory of overlapped speech
2012 (English)In: Proceedings of Workshop on Feedback Behaviors in Dialog, 2012, 57-60 p.Conference paper, Poster (with or without abstract) (Refereed)
Keyword
feedback, functions of feedback, goal driven categories, taxonomy
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-102329 (URN)
Conference
The Interdisciplinary Workshop on Feedback Behaviors in Dialog
Projects
SAMSYNTIURO
Funder
Swedish Research Council, 2009-4291EU, European Research Council, FP7 – 248314
Note

QC 20120914

Available from: 2012-09-13 Created: 2012-09-13 Last updated: 2012-09-14Bibliographically approved
3. Semi-supervised methods for exploring the acoustics of simple productive feedback
Open this publication in new window or tab >>Semi-supervised methods for exploring the acoustics of simple productive feedback
2013 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 3, 451-469 p.Article in journal (Refereed) Published
Abstract [en]

This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

Keyword
social signal processing, affective annotation, feedback modeling, grounding 2000 MSC
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-102334 (URN)10.1016/j.specom.2012.12.007 (DOI)000316837000005 ()2-s2.0-84875460872 (Scopus ID)
Projects
SAMSYNTIURO
Funder
Swedish Research Council, 2009-4291EU, European Research Council, FP7 – 248314
Note

QC 20130508

Available from: 2012-09-14 Created: 2012-09-14 Last updated: 2017-12-07Bibliographically approved
4. Prosodic cues to engagement in non-lexical response tokens in Swedish
Open this publication in new window or tab >>Prosodic cues to engagement in non-lexical response tokens in Swedish
2010 (English)In: Proceedings of DiSS-LPSS Joint Workshop 2010, Tokyo, Japan, 2010Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Tokyo, Japan: , 2010
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52148 (URN)
Conference
DiSS-LPSS Joint Workshop 2010, University of Tokyo, Japan, September 25-26, 2010
Note
tmh_import_11_12_14. QC 20120125Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2012-09-14Bibliographically approved
5. Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English
Open this publication in new window or tab >>Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English
2012 (English)In: Proceedings of Fonetik, 2012Conference paper, Oral presentation only (Other academic)
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-102330 (URN)
Conference
Fonetik
Projects
IUROSAMSYNT
Funder
EU, European Research Council, FP7 – 248314Swedish Research Council, 2009-4291ICT - The Next Generation
Note

QC 20120914

Available from: 2012-09-14 Created: 2012-09-14 Last updated: 2013-04-15Bibliographically approved
6. Cues to perceived functions of acted and spontaneous feedback expressions
Open this publication in new window or tab >>Cues to perceived functions of acted and spontaneous feedback expressions
2012 (English)In: Proceedings of theInterdisciplinary Workshop on Feedback Behaviors in Dialog, 2012, 53-56 p.Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

We present a two step study where the first part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.

Keyword
feedback, functions of feedback, goal driven categories, taxonomy
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-102333 (URN)
Conference
The Interdisciplinary Workshop on Feedback Behaviors in Dialog
Projects
SAMSYNTIURO
Funder
Swedish Research Council, 2009-4291EU, European Research Council, FP7 – 248314ICT - The Next Generation
Note

QC 20120914

Available from: 2012-09-14 Created: 2012-09-14 Last updated: 2013-04-15Bibliographically approved
7. Predicting Speaker Changes and Listener Responses With And Without Eye-contact
Open this publication in new window or tab >>Predicting Speaker Changes and Listener Responses With And Without Eye-contact
2011 (English)In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy., 2011, 1576-1579 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.

Place, publisher, year, edition, pages
Florence, Italy.: , 2011
Keyword
Turn-taking, Back-channels
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52195 (URN)000316502200396 ()2-s2.0-84865794088 (Scopus ID)978-1-61839-270-1 (ISBN)
Conference
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy
Note

tmh_import_11_12_14 QC 20111216

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2014-01-16Bibliographically approved
8. Continuous Interaction with a Virtual Human
Open this publication in new window or tab >>Continuous Interaction with a Virtual Human
Show others...
2011 (English)In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 4, no 2, 97-118 p.Article in journal (Refereed) Published
Abstract [en]

This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking-and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.

Keyword
Attentive speaking, Continuous interaction, Listener responses, Virtual humans
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52194 (URN)10.1007/s12193-011-0060-x (DOI)000309997100004 ()2-s2.0-80955180056 (Scopus ID)
Note

tmh_import_11_12_14. QC 20111215

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2017-12-08Bibliographically approved
9. A Dual Channel Coupled Decoder for Fillers and Feedback
Open this publication in new window or tab >>A Dual Channel Coupled Decoder for Fillers and Feedback
2011 (English)In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, 3097-3100 p.Conference paper, Published paper (Refereed)
Abstract [en]

This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

Keyword
Conversation, Coupled hidden Markov models, Cross-speaker modeling, Feedback, Filler
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52193 (URN)000316502201265 ()2-s2.0-8486579156 (Scopus ID)978-1-61839-270-1 (ISBN)
Conference
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. 28-31 August 2011
Note

tmh_import_11_12_14. QC 20111222

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2014-01-15Bibliographically approved
10. The Prosody of Swedish Conversational Grunts
Open this publication in new window or tab >>The Prosody of Swedish Conversational Grunts
2010 (English)In: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, 2010, 2562-2565 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper explores conversational grunts in a face-to-face setting. The study investigates the prosody and turn-taking effect of fillers and feedback tokens that has been annotated for attitudes. The grunts were selected from the DEAL corpus and automatically annotated for their turn taking effect. A novel suprasegmental prosodic signal representation and contextual timing features are used for classification and visualization. Classification results using linear discriminant analysis, show that turn-initial feedback tokens lose some of their attitude-signaling prosodic cues compared to non-overlapping continuer feedback tokens. Turn taking effects can be predicted well over chance level, except Simultaneous Starts. However, feedback tokens before places where both speakers take the turn were more similar to feedback continuers than to turn initial feedback tokens.

Keyword
prosody, fillers, feedback, suprasegmental, conversational grunts
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52141 (URN)000313086500255 ()2-s2.0-79959844001 (Scopus ID)978-1-61782-123-3 (ISBN)
Conference
INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba. 26 September 2010 - 30 September 2010
Note

tmh_import_11_12_14. QC 20111222

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2014-01-09Bibliographically approved
11. Intra-, Inter-, and Cross-cultural Classification of Vocal Affect
Open this publication in new window or tab >>Intra-, Inter-, and Cross-cultural Classification of Vocal Affect
2011 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy., 2011, 1592-1595 p.Conference paper, Published paper (Refereed)
Abstract [en]

We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.

Place, publisher, year, edition, pages
Florence, Italy.: , 2011
Keyword
emotion, affect, cross-cultural
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52191 (URN)000316502200400 ()2-s2.0-84865794836 (Scopus ID)978-1-61839-270-1 (ISBN)
Conference
12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011; Florence; Italy; 27 August 2011 through 31 August 2011
Note

tmh_import_11_12_14 QC 20111219

Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2014-01-16Bibliographically approved
12. Emotion Recognition
Open this publication in new window or tab >>Emotion Recognition
2009 (English)In: Computers in the Human Interaction Loop / [ed] Waibel, A.; Stiefelhagen, R, Berlin/Heidelberg: Springer , 2009, 96-105 p.Chapter in book (Refereed)
Place, publisher, year, edition, pages
Berlin/Heidelberg: Springer, 2009
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-52064 (URN)10.1007/978-1-84882-054-8_10 (DOI)978-1-84882-053-1 (ISBN)
Note
tmh_import_11_12_14. QC 20111222Available from: 2011-12-14 Created: 2011-12-14 Last updated: 2012-09-14Bibliographically approved

Open Access in DiVA

Thesis(1611 kB)775 downloads
File information
File name FULLTEXT02.pdfFile size 1611 kBChecksum SHA-512
2247b2ac8f05358ea8e9f1d207a184d20cb7a32d433c190af0c3aafb1c552b1721db96f84b415b77726e9ae8d2881e9de2b6881c06485a5e5d6351826d71065a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Neiberg, Daniel
By organisation
Speech Communication and Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 775 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 636 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf