Change search
Refine search result
1 - 22 of 22
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Abou-Zleikha, Mohamed
    et al.
    Székely, Eva
    University College Dublin, Ireland.
    Cahill, Peter
    Carson-Berndsen, Julie
    Multi-level exemplar-based duration generation for expressive speech synthesis2012In: Proceedings of Speech Prosody, 2012, Vol. 2012Conference paper (Refereed)
    Abstract [en]

    The generation of duration of speech units from linguistic in- formation, as one component of a prosody model, is consid- ered to be a requirement for natural sounding speech synthesis. This paper investigates the use of a multi-level exemplar-based model for duration generation for the purposes of expressive speech synthesis. The multi-level exemplar-based model has been proposed in the literature as a cognitive model for the pro- duction of duration. The implementation of this model for dura- tion generation for speech synthesis is not straightforward and requires a set of modifications to the model and that the linguis- tically related units and the context of the target units should be taken into consideration. The work presented in this paper implements this model and presents a solution to these issues through the use of prosodic-syntactic correlated data, full con- text information of the input example and corpus exemplars. 

  • 2. Ahmed, Zeeshan
    et al.
    Steiner, Ingmar
    Székely, Éva
    CNGL, UCD.
    Carson-Berndsen, Julie
    A system for facial expression-based affective speech translation2013In: Proceedings of the companion publication of the 2013 international conference on Intelligent user interfaces companion, 2013, p. 57-58Conference paper (Refereed)
    Abstract [en]

    In the emerging eld of speech-to-speech translation, empha- sis is currently placed on the linguistic content, while the sig- ni cance of paralinguistic information conveyed by facial ex- pression or tone of voice is typically neglected. We present a prototype system for multimodal speech-to-speech transla- tion that is able to automatically recognize and translate spo- ken utterances from one language into another, with the out- put rendered by a speech synthesis system. The novelty of our system lies in the technique of generating the synthetic speech output in one of several expressive styles that is au- tomatically determined using a camera to analyze the user’s facial expression during speech. 

  • 3. Betz, Simon
    et al.
    Zarrieß, Sina
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Wagner, Petra
    The greennn tree - lengthening position influences uncertainty perception2019Conference paper (Refereed)
  • 4. Cabral, Joao P
    et al.
    Kane, Mark
    Ahmed, Zeeshan
    Abou-Zleikha, Mohamed
    Székely, Éva
    University College Dublin, Ireland.
    Zahra, Amalia
    Ogbureke, Kalu U
    Cahill, Peter
    Carson-Berndsen, Julie
    Schlögl, Stephan
    Rapidly Testing the Interaction Model of a Pronunciation Training System via Wizard-of-Oz.2012In: Proceedings of the International Conference on Language Resources and Evaluation, 2012, p. 4136-4142Conference paper (Refereed)
    Abstract [en]

    This paper describes a prototype of a computer-assisted pronunciation training system called MySpeech. The interface of the MySpeech system is web-based and it currently enables users to practice pronunciation by listening to speech spoken by native speakers and tuning their speech production to correct any mispronunciations detected by the system. This practice exercise is facilitated in different topics and difficulty levels. An experiment was conducted in this work that combines the MySpeech service with the WebWOZ Wizard-of-Oz platform (http://www.webwoz.com), in order to improve the human-computer interaction (HCI) of the service and the feedback that it provides to the user. The employed Wizard-of-Oz method enables a human (who acts as a wizard) to give feedback to the practising user, while the user is not aware that there is another person involved in the communication. This experiment permitted to quickly test an HCI model before its implementation on the MySpeech system. It also allowed to collect input data from the wizard that can be used to improve the proposed model. Another outcome of the experiment was the preliminary evalua- tion of the pronunciation learning service in terms of user satisfaction, which would be difficult to conduct before integrating the HCI part. 

  • 5. Cabral, Joao P
    et al.
    Kane, Mark
    Ahmed, Zeeshan
    Székely, Éva
    University College Dublin, Ireland.
    Zahra, Amalia
    Ogbureke, Kalu U
    Cahill, Peter
    Carson-Berndsen, Julie
    Schlögl, Stephan
    Using the Wizard-of-Oz Framework in a Pronunciation Training System for Providing User Feedback and Instructions2012Conference paper (Refereed)
  • 6. Cahill, Peter
    et al.
    Ogbureke, Udochukwu
    Cabral, Joao
    Székely, Éva
    University College Dublin, Ireland.
    Abou-Zleikha, Mohamed
    Ahmed, Zeeshan
    Carson-Berndsen, Julie
    Ucd blizzard challenge 2011 entry2011In: Proceedings of the Blizzard Challenge Workshop, 2011Conference paper (Refereed)
    Abstract [en]

    This paper gives an overview of the UCD Blizzard Challenge 2011 entry. The entry is a unit selection synthesiser that uses hidden Markov models for prosodic modelling. The evaluation consisted of synthesising 2213 sentences from a high quality 15 hour dataset provided by Lessac Technologies. Results are analysed within the context of other systems and the future work for the system is discussed. 

  • 7.
    Clark, Leigh
    et al.
    Univ Coll Dublin, Dublin, Ireland..
    Cowan, Benjamin R.
    Univ Coll Dublin, Dublin, Ireland..
    Edwards, Justin
    Univ Coll Dublin, Dublin, Ireland..
    Munteanu, Cosmin
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Murad, Christine
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Aylett, Matthew
    CereProc Ltd, Edinburgh, Midlothian, Scotland..
    Moore, Roger K.
    Univ Sheffield, Sheffield, S Yorkshire, England..
    Edlund, Jens
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Healey, Patrick
    Queen Mary Univ London, London, England..
    Harte, Naomi
    Trinity Coll Dublin, Dublin, Ireland..
    Torre, Ilaria
    Trinity Coll Dublin, Dublin, Ireland..
    Doyle, Philip
    Voysis Ltd, Dublin, Ireland..
    Mapping Theoretical and Methodological Perspectives for Understanding Speech Interface Interactions2019In: CHI EA '19 EXTENDED ABSTRACTS: EXTENDED ABSTRACTS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, ASSOC COMPUTING MACHINERY , 2019Conference paper (Refereed)
    Abstract [en]

    The use of speech as an interaction modality has grown considerably through the integration of Intelligent Personal Assistants (IPAs- e.g. Siri, Google Assistant) into smartphones and voice based devices (e.g. Amazon Echo). However, there remain significant gaps in using theoretical frameworks to understand user behaviours and choices and how they may applied to specific speech interface interactions. This part-day multidisciplinary workshop aims to critically map out and evaluate theoretical frameworks and methodological approaches across a number of disciplines and establish directions for new paradigms in understanding speech interface user behaviour. In doing so, we will bring together participants from HCI and other speech related domains to establish a cohesive, diverse and collaborative community of researchers from academia and industry with interest in exploring theoretical and methodological issues in the field.

  • 8. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Cabral, Joao P
    Carson-Berndsen, Julie
    WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices2012In: Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies, Association for Computational Linguistics , 2012, p. 5-8Conference paper (Refereed)
    Abstract [en]

    This paper describes a demonstration of the WinkTalk system, which is a speech synthe- sis platform using expressive synthetic voices. With the help of a webcamera and facial ex- pression analysis, the system allows the user to control the expressive features of the syn- thetic speech for a particular utterance with their facial expressions. Based on a person- alised mapping between three expressive syn- thetic voices and the users facial expressions, the system selects a voice that matches their face at the moment of sending a message. The WinkTalk system is an early research pro- totype that aims to demonstrate that facial expressions can be used as a more intuitive control over expressive speech synthesis than manual selection of voice types, thereby con- tributing to an improved communication expe- rience for users of speech generating devices. 

  • 9. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Cabral, Joao P
    Carson-Berndsen, Julie
    WinkTalk: a multimodal speech synthesis interface linking facial expressions to expressive synthetic voices2012In: Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies, 2012Conference paper (Refereed)
    Abstract [en]

    This paper describes a demonstration of the WinkTalk system, which is a speech synthe- sis platform using expressive synthetic voices. With the help of a webcamera and facial ex- pression analysis, the system allows the user to control the expressive features of the syn- thetic speech for a particular utterance with their facial expressions. Based on a person- alised mapping between three expressive syn- thetic voices and the users facial expressions, the system selects a voice that matches their face at the moment of sending a message. The WinkTalk system is an early research pro- totype that aims to demonstrate that facial expressions can be used as a more intuitive control over expressive speech synthesis than manual selection of voice types, thereby con- tributing to an improved communication expe- rience for users of speech generating devices. 

  • 10. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Hennig, Shannon
    Cabral, Joao P
    Carson-Berndsen, Julie
    Predicting synthetic voice style from facial expressions. An application for augmented conversations2014In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 57, p. 63-75Article in journal (Refereed)
    Abstract [en]

    The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automat- ically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener. 

  • 11. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Steiner, Ingmar
    Carson-Berndsen, Julie
    Facial expression as an input annotation modality for affective speech-to-speech translation2012Conference paper (Refereed)
    Abstract [en]

    One of the challenges of speech-to-speech translation is to accurately preserve the paralinguistic information in the speaker’s message. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this ap- proach, a prototype system, FEAST (Facial Expression-based Affective Speech Translation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis. 

  • 12. Székely, Éva
    et al.
    Cabral, Joao P
    Abou-Zleikha, Mohamed
    Cahill, Peter
    Carson-Berndsen, Julie
    Evaluating expressive speech synthesis from audiobooks in conversational phrases2012Conference paper (Refereed)
    Abstract [en]

    Audiobooks are a rich resource of large quantities of natural sounding, highly expressive speech. In our previous research we have shown that it is possible to detect different expressive voice styles represented in a particular audiobook, using unsupervised clustering to group the speech corpus of the audiobook into smaller subsets representing the detected voice styles. These subsets of corpora of different voice styles reflect the various ways a speaker uses their voice to express involvement and affect, or imitate characters. This study is an evaluation of the detection of voice styles in an audiobook in the application of expressive speech synthesis. A further aim of this study is to investigate the usability of audiobooks as a language resource for expressive speech synthesis of utterances of conversational speech. Two evaluations have been carried out to assess the effect of the genre transfer: transmitting expressive speech from read aloud literature to conversational phrases with the application of speech synthesis. The first evaluation revealed that listeners have different voice style preferences for a particular conversational phrase. The second evaluation showed that it is possible for users of speech synthesis systems to learn the characteristics of a certain voice style well enough to make reliable predictions about what a certain utterance will sound like when synthesised using that voice style. 

  • 13. Székely, Éva
    et al.
    Cabral, Joao P
    Cahill, Peter
    Carson-Berndsen, Julie
    Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters.2011In: 12th Annual Conference of the International-Speech-Communication-Association 2011 (INTERSPEECH 2011), ISCA-INT SPEECH COMMUNICATION ASSOC , 2011, p. 2409-2412Conference paper (Refereed)
    Abstract [en]

    A great challenge for text-to-speech synthesis is to produce ex- pressive speech. The main problem is that it is difficult to syn- thesise high-quality speech using expressive corpora. With the increasing interest in audiobook corpora for speech synthesis, there is a demand to synthesise speech which is rich in prosody, emotions and voice styles. In this work, Self-Organising Fea- ture Maps (SOFM) are used for clustering the speech data using voice quality parameters of the glottal source, in order to map out the variety of voice styles in the corpus. Subjective evalu- ation showed that this clustering method successfully separated the speech data into groups of utterances associated with dif- ferent voice characteristics. This work can be applied in unit- selection synthesis by selecting appropriate data sets to synthe- sise utterances with specific voice styles. It can also be used in parametric speech synthesis to model different voice styles separately. 

  • 14. Székely, Éva
    et al.
    Csapo, Tamas Gabor
    Toth, Balint
    Mihajlik, Peter
    Carson-Berndsen, Julie
    Synthesizing expressive speech from amateur audiobook recordings2012In: Spoken Language Technology Workshop (SLT), 2012, p. 297-302Conference paper (Refereed)
    Abstract [en]

    Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a profes- sionally trained speaker. The majority of freely available au- diobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesiz- ing expressive speech from a typical online audiobook there- fore poses many challenges. In this work we address these challenges by applying a method consisting of minimally su- pervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expres- sive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional mes- sage. We used a restricted amount of speech data in our exper- iment, in order to show that the method is generally applicable to most typical audiobooks widely available online. 

  • 15.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    How to train your fillers: uh and um in spontaneous speech synthesis2019Conference paper (Refereed)
  • 16.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Off the cuff: Exploring extemporaneous speech delivery with TTS2019Conference paper (Refereed)
  • 17.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Spontaneous conversational speech synthesis from found data2019Conference paper (Refereed)
  • 18.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR2019In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE , 2019, p. 6925-6929Conference paper (Refereed)
    Abstract [en]

    This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

  • 19. Székely, Éva
    et al.
    Kane, John
    Scherer, Stefan
    Gobl, Christer
    Carson-Berndsen, Julie
    Detecting a targeted voice style in an audiobook using voice quality features2012In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 2012, p. 4593-4596Conference paper (Refereed)
    Abstract [en]

    Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a pa- rameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky seg- ments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the re- gions in an audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier en- semble output, and evaluated in a listening test. 

  • 20.
    Székely, Éva
    et al.
    KTH.
    Keane, Mark T
    Carson-Berndsen, Julie
    The Effect of Soft, Modal and Loud Voice Levels on Entrainment in Noisy Conditions2015In: Sixteenth Annual Conference of the International Speech Communication Association, 2015Conference paper (Refereed)
    Abstract [en]

    Conversation partners have a tendency to adapt their vocal in- tensity to each other and to other social and environmental fac- tors. A socially adequate vocal intensity level by a speech syn- thesiser that goes beyond mere volume adjustment is highly de- sirable for a rewarding and successful human-machine or ma- chine mediated human-human interaction. This paper examines the interaction of the Lombard effect and speaker entrainment in a controlled experiment conducted with a confederate inter- locutor. The interlocutor was asked to maintain either a soft, a modal or a loud voice level during the dialogues. Through half of the trials, subjects were exposed to a cocktail party noise through headphones. The analytical results suggest that both the background noise and the interlocutor’s voice level affect the dynamics of speaker entrainment. Speakers appear to still en- train to the voice level of their interlocutor in noisy conditions, though to a lesser extent, as strategies of ensuring intelligibility affect voice levels as well. These findings could be leveraged in spoken dialogue systems and speech generating devices to help choose a vocal effort level for the synthetic voice that is both intelligible and socially suited to a specific interaction. 

  • 21. Székely, Éva
    et al.
    Steiner, Ingmar
    Ahmed, Zeeshan
    Carson-Berndsen, Julie
    Facial expression-based affective speech translation2014In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 8, no 1, p. 87-96Article in journal (Refereed)
    Abstract [en]

    One of the challenges of speech-to-speech trans- lation is to accurately preserve the paralinguistic informa- tion in the speaker’s message. Information about affect and emotional intent of a speaker are often carried in more than one modality. For this reason, the possibility of multimodal interaction with the system and the conversation partner may greatly increase the likelihood of a successful and gratifying communication process. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this approach, a prototype sys- tem, FEAST (facial expression-based affective speech trans- lation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis. 

  • 22.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Wagner, Petra
    KTH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    THE WRYLIE-BOARD: MAPPING ACOUSTIC SPACE OF EXPRESSIVE FEEDBACK TO ATTITUDE MARKERS2018In: Proc. IEEE Spoken Language Technology conference, 2018Conference paper (Refereed)
1 - 22 of 22
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf