Audio-Visual Classification and Detection of Human Manipulation Actions
2014 (English)In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), IEEE conference proceedings, 2014, 3045-3052 p.Conference paper (Refereed)
Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.
Place, publisher, year, edition, pages
IEEE conference proceedings, 2014. 3045-3052 p.
, IEEE International Conference on Intelligent Robots and Systems, ISSN 2153-0858
Acoustic data, Audio-visual, Audio-visual data, Coherent representations, Human manipulation, Multiple source, Unstructured environments, Visual data
Computer Vision and Robotics (Autonomous Systems)
IdentifiersURN: urn:nbn:se:kth:diva-158004DOI: 10.1109/IROS.2014.6942983ISI: 000349834603023ScopusID: 2-s2.0-84911478073ISBN: 978-1-4799-6934-0OAI: oai:DiVA.org:kth-158004DiVA: diva2:773353
2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2014, Palmer House Hilton Hotel Chicago, United States, 14 September 2014 through 18 September 2014
QC 201501222014-12-182014-12-182015-05-04Bibliographically approved