Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Modelling Engagement in Multi-Party Conversations: Data-Driven Approaches to Understanding Human-Human Communication Patterns for Use in Human-Robot Interactions
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. (TMH)ORCID iD: 0000-0002-8273-0132
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The aim of this thesis is to study human-human interaction in order to provide virtual agents and robots with the capability to engage into multi-party-conversations in a human-like-manner. The focus lies with the modelling of conversational dynamics and the appropriate realization of multi-modal feedback behaviour. For such an undertaking, it is important to understand how human-human communication unfolds in varying contexts and constellations over time. To this end, multi-modal human-human corpora are designed as well as annotation schemes to capture conversational dynamics are developed. Multi-modal analysis is carried out and models are built. Emphasis is put on not modelling speaker behaviour in general and on modelling listener behaviour in particular.

In this thesis, a bridge is built between multi-modal modelling of conversational dynamics on the one hand multi-modal generation of listener behaviour in virtual agents and robots on the other hand. In order to build this bridge, a unit-selection multi-modal synthesis is carried out as well as a statistical speech synthesis of feedback. The effect of a variation in prosody of feedback token on the perception of third-party observers is evaluated. Finally, the effect of a controlled variation of eye-gaze is evaluated, as is the perception of user feedback in human-robot interaction.​

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2016. , p. 87
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2017:05
National Category
Engineering and Technology
Research subject
Human-computer Interaction
Identifiers
URN: urn:nbn:se:kth:diva-198175ISBN: 978-91-7729-237-1 (print)OAI: oai:DiVA.org:kth-198175DiVA, id: diva2:1055854
Public defence
2017-01-20, F3, Lindstedtsvägen 26, Kungl Tekniska högskolan, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20161214

Available from: 2016-12-14 Created: 2016-12-13 Last updated: 2022-06-27Bibliographically approved
List of papers
1. D64: A corpus of richly recorded conversational interaction
Open this publication in new window or tab >>D64: A corpus of richly recorded conversational interaction
Show others...
2013 (English)In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 7, no 1-2, p. 19-28Article in journal (Refereed) Published
Abstract [en]

In recent years there has been a substantial debate about the need for increasingly spontaneous, conversational corpora of spoken interaction that are not controlled or task directed. In parallel the need has arisen for the recording of multi-modal corpora which are not restricted to the audio domain alone. With a corpus that would fulfill both needs, it would be possible to investigate the natural coupling, not only in turn-taking and voice, but also in the movement of participants. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of dynamic interaction. The D64 corpus is a multimodal corpus recorded over two successive days. Each day resulted in approximately 4 h of recordings. In total five participants took part in the recordings of whom two participants were female and three were male. Seven video cameras were used of which at least one was trained on each participant. The Optitrack motion capture kit was used in order to enrich information. The D64 corpus comprises annotations on conversational involvement, speech activity and pauses as well as information of the average degree of change in the movement of participants.

Keywords
Multimodality corpus, Conversational involvement, Spontaneous speech
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-109373 (URN)10.1007/s12193-012-0108-6 (DOI)000316062300003 ()2-s2.0-84874773796 (Scopus ID)
Funder
Swedish Research Council, 2009-1766
Note

QC 20130415

Available from: 2013-01-02 Created: 2013-01-02 Last updated: 2025-02-01Bibliographically approved
2. A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue
Open this publication in new window or tab >>A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue
2013 (English)In: ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2013, p. 99-106Conference paper, Published paper (Refereed)
Abstract [en]

This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71% of accuracy.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2013
Keywords
interaction, gaze, engagement, involvement, classication
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-137417 (URN)10.1145/2522848.2522865 (DOI)000380272900018 ()2-s2.0-84892620374 (Scopus ID)978-1-4503-2129-7 (ISBN)
Conference
2013 15th ACM International Conference on Multimodal Interaction, ICMI 2013; Sydney, NSW; Australia; 9 December 2013 through 13 December 2013
Note

QC 20130128

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2025-02-01Bibliographically approved
3. Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions
Open this publication in new window or tab >>Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions
2015 (English)In: Proccedings of ICMI 2015, ACM Digital Library, 2015Conference paper, Published paper (Refereed)
Abstract [en]

Estimating a silent participant's degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio-visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64 %.

Place, publisher, year, edition, pages
ACM Digital Library, 2015
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-180426 (URN)10.1145/2818346.2820759 (DOI)000380609500018 ()2-s2.0-84959309012 (Scopus ID)978-1-4503-3912-4 (ISBN)
Conference
ICMI 2015
Note

QC 20160121

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2025-02-01Bibliographically approved
4. Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Feedback Utterances
Open this publication in new window or tab >>Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Feedback Utterances
2016 (English)In: Proceedings of Interspeech 2016, San Fransisco, USA, 2016Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
San Fransisco, USA: , 2016
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-198174 (URN)10.21437/Interspeech.2016-1274 (DOI)000409394401288 ()2-s2.0-84994246003 (Scopus ID)
Conference
Interspeech 2016
Note

QC 20161214

Available from: 2016-12-13 Created: 2016-12-13 Last updated: 2025-02-01Bibliographically approved
5. Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens
Open this publication in new window or tab >>Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens
Show others...
2016 (English)In: Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI 2016), Tokyo, Japan, 2016Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Tokyo, Japan: , 2016
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-198171 (URN)10.1145/2993148.2993188 (DOI)000390299900007 ()2-s2.0-85016607242 (Scopus ID)
Conference
the 18th ACM International Conference on Multimodal Interaction (ICMI 2016)
Note

QC 20161214

Available from: 2016-12-13 Created: 2016-12-13 Last updated: 2025-02-01Bibliographically approved
6. On Data Driven Parametric Backchannel Synthesis for Expressing Attentiveness in Conversational Agents
Open this publication in new window or tab >>On Data Driven Parametric Backchannel Synthesis for Expressing Attentiveness in Conversational Agents
2016 (English)In: Proceedings of Multimodal Analyses enabling Artificial Agents in Human­-Machine Interaction (MA3HMI), satellite workshop of ICMI 2016, 2016Conference paper, Published paper (Refereed)
Abstract [en]

In this study, we are using a multi-party recording as a template for building a parametric speech synthesiser which is able to express different levels of attentiveness in backchannel tokens. This allowed us to investigate i) whether it is possible to express the same perceived level of attentiveness in synthesised than in natural backchannels; ii) whether it is possible to increase and decrease the perceived level of attentiveness of backchannels beyond the range observed in the original corpus.

Keywords
Attentive agents, Backchannels, Synthesis
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-198173 (URN)10.1145/3011263.3011272 (DOI)2-s2.0-85003674254 (Scopus ID)
Conference
Multimodal Analyses enabling Artificial Agents in Human­-Machine Interaction (MA3HMI), satellite workshop of ICMI 2016
Note

QC 20161214

Available from: 2016-12-13 Created: 2016-12-13 Last updated: 2025-02-01Bibliographically approved
7. Gaze direction as a Back-Channel inviting Cue in Dialogue
Open this publication in new window or tab >>Gaze direction as a Back-Channel inviting Cue in Dialogue
2012 (English)In: IVA 2012 workshop on Realtime Conversational Virtual Agents, Santa Cruz, CA, USA, 2012Conference paper, Published paper (Refereed)
Abstract [en]

In this study, we experimentally explore the relationship between gaze direction and backchannels in face-to-face interaction. The overall motivation is to use gaze direction in a virtual agent as a mean to elicit user feedback. The relationship between gaze and backchannels was tested in an experiment in which participants were asked to provide feedback when listening to a story-telling virtual agent. When speaking, the agent shifted her gaze towards the listener at predefined positions in the dialogue. The results show that listeners are more prone to backchannel when the virtual agent’s gaze is directed towards them than when it is directed away. However, there is a high response variability for different dialogue contexts which suggests that the timing of backchannels cannot be explained by gaze direction alone.

Place, publisher, year, edition, pages
Santa Cruz, CA, USA: , 2012
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-109388 (URN)
Conference
IVA 2012 workshop on Realtime Conversational Virtual Agents, September 15th, 2012, Santa Cruz, California
Note

tmh_import_13_01_02, tmh_id_3785. QC 20130103

Available from: 2013-01-02 Created: 2013-01-02 Last updated: 2025-02-01Bibliographically approved
8. Turn-taking, feedback and joint attention in situated human-robot interaction
Open this publication in new window or tab >>Turn-taking, feedback and joint attention in situated human-robot interaction
2014 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 65, p. 50-66Article in journal (Refereed) Published
Abstract [en]

In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user's and the robot's gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot's speech. By analysing the participants' subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot's gaze when talking about landmarks, and that the robot's verbal and gaze behaviour has a strong effect on the users' turn-taking behaviour. We also present an analysis of the users' gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user's level of uncertainty.

Keywords
Turn-taking, Feedback, Joint attention, Prosody, Gaze, Uncertainty
National Category
Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-154366 (URN)10.1016/j.specom.2014.05.005 (DOI)000341901700005 ()2-s2.0-84903625192 (Scopus ID)
Funder
Swedish Research Council, 2011-6237 2011-6152EU, FP7, Seventh Framework Programme, 288667
Note

QC 20141021

Available from: 2014-10-21 Created: 2014-10-20 Last updated: 2024-03-15Bibliographically approved
9. The KTH Games Corpora: How to Catch a Werewolf
Open this publication in new window or tab >>The KTH Games Corpora: How to Catch a Werewolf
Show others...
2013 (English)In: IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video: MMC 2013, 2013Conference paper, Published paper (Refereed)
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-137399 (URN)
Conference
IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video - MMC 2013; Edinburgh, UK, 1 September 2013
Note

QC 20140623

tmh_import_13_12_13, tmh_id_3871

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2025-02-01Bibliographically approved
10. Who will get the grant?: A multimodal corpus for the analysis of conversational behaviours in group interviews
Open this publication in new window or tab >>Who will get the grant?: A multimodal corpus for the analysis of conversational behaviours in group interviews
Show others...
2014 (English)In: UM3I 2014 - Proceedings of the 2014 ACM Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Co-located with ICMI 2014, Association for Computing Machinery (ACM), 2014, p. 27-32Conference paper, Published paper (Refereed)
Abstract [en]

In the last couple of years more and more multimodal corpora have been created. Recently many of these corpora have also included RGB-D sensors' data. However, there is to our knowledge no publicly available corpus, which combines accurate gaze-tracking, and high- quality audio recording for group discussions of varying dynamics. With a corpus that would fulfill these needs, it would be possible to investigate higher level constructs such as group involvement, individual engagement or rapport, which all require multimodal feature extraction. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of group dynamics.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2014
Keywords
Corpus collection, Eye-gaze, Group dynamics, Involvement
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-158171 (URN)10.1145/2666242.2666251 (DOI)2-s2.0-84919344128 (Scopus ID)978-145030652-2 (ISBN)
Conference
ICMI 2014 Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, UM3I 2014, Istanbul, Turkey, 16 November 2014
Note

QC 20150203

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

fulltext(1292 kB)1122 downloads
File information
File name FULLTEXT01.pdfFile size 1292 kBChecksum SHA-512
634f83bc6313d7eeb2201a5a33ec60f12672c00c9dc5715cdf3999d0659c5bb9f146301f6a2c2aae3c24e91c9e6f333302c55af7765af9f9fb4c6b83aa64c5a0
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Oertel, Catharine
By organisation
Speech, Music and Hearing, TMH
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 1123 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2068 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf