Prosodic research on speech-gesture integration has shown that gestures temporally align with prominent units in speech, e.g., [1][2][3], as formulated in terms of the phonological synchronization rule by McNeill [4]. Some studies have also provided evidence that speech and gesture may converge not only in the temporal, but also in the “spatial” domain, displaying correlations between the presence and strength of gestures (magnitude or complexity of gestures) with the strength of acoustic parameters in the production of prosodic prominence (as reflected, for instance, in the accentual fundamental frequency [fo] range), e.g., [5][6][7][8]. This spatial convergence has been formulated in terms of the Cumulative-Cue Hypothesis [7][9] and has been argued to result from an underlying compulsion to express prominence in both speech and gesture, all else being equal. This compulsion could be understood as part of a revised Effort Code [10]: To signal prominence, we tend to produce vocal and gestural signals indicating an increased level of effort [9]. However, evidence in favor of the Cumulative-Cue Hypothesis is still rather sparse and diverse and stems mostly from studies involving instructed or elicited movements, rather than naturally occurring co-speech gestures. Also, most studies have strictly focused on arm or hand gestures, and hardly any studies have considered gestural clustering (e.g., combined hand and head gestures) as a possible dimension of gestural strength. The present study extends this line of research, studying the realization of phrase-level pitch accents in Swedish (so-called ‘big accents’, see Fig. 1) as a function of accompanying manual gesture strokes and eyebrow movements. Our materials consist of Swedish spontaneous dyadic conversations taken from the Spontal Corpus [11]. So far, data from eight speakers (four female, four male; 20 minutes in total, or 4294 words) have been included in our preliminary analysis (Tab.1, Fig. 2), but more data are currently being processed. Big accents (BA) were manually labelled with access to the audio channel and an fo display, but without using the video channel. Manual gestures (MG) and eyebrow movements (EB) were manually labelled with access to the video only. All events (BA, MG, EB) were labelled, at least partially, by two annotators, revealing acceptable inter-rater reliabilities (κBA=.78; κEB=.70; κMG=.82). For BAs, fo landmarks were annotated manually (Fig. 1), following the criteria specified in [7]. Based on these landmarks, two dependent variables were calculated: the range (in semitones) of the accentual fall (if present) of the potentially two-peaked BA (see Fig. 1), and the range of the subsequent big-accent rise. Linear mixed modeling and likelihood ratio tests were used to assess how well the ranges of the fall and the rise are predictable by the presence of gesture (clusters), operationalized as a predictor MMP (multimodal prominence), comprising the four levels BA (= accent only, no gesture), BA+MG, BA+EB, BA+MG+EB. In this preliminary data set, EBs and MGs seldom clustered (Tab. 1), that is, pitch accents most often occurred either with a manual or an eyebrow gesture, or without any gesture. The preliminary results reveal a significant trend for larger fo rises when an eyebrow movement accompanies the accented word, as indicated by a significant contribution of the predictor MMP (χ2=19.93, df=3, p<.001), and significant post-hoc comparisons for BA vs. BA+EB (t=4.96, df=758, p<.001) and BA+MG vs. BA+EB (t=4.75, df=758, p<.001). This suggests new, partial evidence for the Cumulative-Cue Hypothesis, although results for the BA+MG+EB cluster seem to suggest counterevidence. However, this preliminary data set contains very few data points for BA+MG+EB. At the conference, we will also discuss these results in relation to other hypotheses characterized by trading relationships.
2024. p. 119-120
The 2nd International Multimodal Communication Symposium (MMSYM), Frankfurt, Germany, September 25-27, 2024