Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology. University of Southern California.ORCID iD: 0000-0002-0861-8660
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-1399-6604
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0002-3323-5311
2019 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed) Published
Abstract [en]

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

Place, publisher, year, edition, pages
2019.
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-260126DOI: 10.1109/TCDS.2019.2927941Scopus ID: 2-s2.0-85069908129OAI: oai:DiVA.org:kth-260126DiVA, id: diva2:1354629
Note

QC 20191011

Available from: 2019-09-25 Created: 2019-09-25 Last updated: 2019-10-11Bibliographically approved

Open Access in DiVA

fulltext(2453 kB)26 downloads
File information
File name FULLTEXT01.pdfFile size 2453 kBChecksum SHA-512
96aa60cb3580ceb4ffd093e6c67dd1995f7e242e30345b06b0440ce15f23104334153d6247ca0b37ff65d81d63c977369c6ce35e614dd76385f03af8f4e2ed84
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusPublisher

Search in DiVA

By author/editor
Stefanov, KalinBeskow, JonasSalvi, Giampiero
By organisation
Speech Communication and Technology
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 26 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 66 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf