Methods for Improving Voice Activity Detection in Communication Services
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
A video conferencing application has to display only active sites due to limited display area that are identified using voice activity detector (VAD) and maintain a list of the most vocally active sites. In a typical video conferencing room there will be people typing on their computers or laptops and this can cause problem when the VAD classifies the keyboard typing signals as speech activity even there is nobody talking in the room. As a result the vocally inactive site is not removed from the list of active sites and thus blocks another vocally active site from being added to the list, thus creating a very bad user experience in the video conference. Current VAD often classify keyboard typing as active speech.
In this thesis work, we explore two main approaches to solve the problem. First approach is based on identification of keystroke signals in the mixed audio data (speech and keyboard signal). In this approach we explore various audio signal classification approaches based on temporal and spectral features of speech and keystroke signals as well as prediction model based classification. We evaluate and compare this approach by varying parameters and maximizing the percentage of correctly-classified keystroke frames as true-keystroke frames whereas minimizing the falsely-classified keystroke frames among non true-keystroke frames. The evaluated keystroke identification approach is based on thresholding the model error that resulted into 85% accuracy using one previous and one future frame. The falsely-classified frames as keystroke frames in this approach are mainly due to the plosive sounds in the audio signal due to the similar characteristics as that of keystroke signal.
Second approach is based on finding a mechanism to complement VAD such that it doesn’t trigger at keystroke signals. For this purpose we explore different methods for improving pitch detection functionality in the VAD. We evaluate a new pitch detector which computes pitch using autocorrelation of the normalized signal frames. Then we design a new speech detector which consists of the new pitch detector along with hangover addition that separates the mixed audio data into speech region and non-speech region in real time. The new speech detector doesn’t trigger at keystroke frames i.e. it places the keystroke frames in non-speech region and hence solves the problem.
Place, publisher, year, edition, pages
IT, 13 001
Engineering and Technology
IdentifiersURN: urn:nbn:se:uu:diva-192096OAI: oai:DiVA.org:uu-192096DiVA: diva2:588802
Master Programme in Computer Science
Lundberg Nordenvaad, MagnusKaati, Lisa