Speech adaptation of special voice classes
Most automatic speech recognition systems are based on statistical models that
require training. While these types of systems have reached recognition rates
that are sufficient for many purposes, they perform poorly for speaker types
that are not present in the training material. Children are often absent from
training material for speech recognizers, and creating good training material
for children can be difficult and expensive.
To address this issue, this thesis focuses on using adult training material to
train a recognizer for children by adapting the training material during
training. Instead of performing speaker-dependent adaptation during
recognition, where computational power may be scarce, and responsiveness may be
essential, adaptation is performed during training towards a class of speakers.
Using a combination of vocal tract length normalization (VTLN) and cepstral
mean normalization during training, promising results have been obtained. In a
connected-digits task, a reduction in errors as high as 70% was shown, with a
reduction of almost 50% in a large vocabulary task. Using VTLN to warp the
same training material several times, combining these warped materials to train
one recognizer, a similar reduction in errors was shown, but with an increased
robustness indicating a less speaker-dependent system. It is also shown that a
piecewise linear warping method is better suited to warp adult speech to child
speech, than a bilinear warping method.
Place, publisher, year, edition, pages
Institutt for elektronikk og telekommunikasjon , 2011. , 67 p.
ntnudaim:6125, MTEL elektronikk, Akustikk
IdentifiersURN: urn:nbn:no:ntnu:diva-13646Local ID: ntnudaim:6125OAI: oai:DiVA.org:ntnu-13646DiVA: diva2:441345
Svendsen, Torbjørn, Professor