Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
Institute for Creative Technologies, University of Southern California, Los Angeles, CA 90089, United States.ORCID iD: 0000-0002-0861-8660
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. NTNU Norwegian University of Science and Technology, Trondheim, Norway.ORCID iD: 0000-0002-3323-5311
2020 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920, E-ISSN 2379-8939, Vol. 12, no 2, p. 250-259, article id 8758947Article in journal (Refereed) Published
Abstract [en]

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2020. Vol. 12, no 2, p. 250-259, article id 8758947
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-260126DOI: 10.1109/TCDS.2019.2927941ISI: 000542972700013Scopus ID: 2-s2.0-85069908129OAI: oai:DiVA.org:kth-260126DiVA, id: diva2:1354629
Note

QC 20200625

Available from: 2019-09-25 Created: 2019-09-25 Last updated: 2024-06-24Bibliographically approved

Open Access in DiVA

fulltext(2453 kB)288 downloads
File information
File name FULLTEXT01.pdfFile size 2453 kBChecksum SHA-512
96aa60cb3580ceb4ffd093e6c67dd1995f7e242e30345b06b0440ce15f23104334153d6247ca0b37ff65d81d63c977369c6ce35e614dd76385f03af8f4e2ed84
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Stefanov, KalinBeskow, JonasSalvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
In the same journal
IEEE Transactions on Cognitive and Developmental Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 288 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 633 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf