Change search
ReferencesLink to record
Permanent link

Direct link
Methods for Improving Voice Activity Detection in Communication Services
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

A video conferencing application has to display only active sites due to limited display area that are identified using voice activity detector  (VAD) and maintain a list of the most vocally active sites. In a typical video conferencing room there will be people typing on their computers or laptops and this can cause problem when the VAD classifies the keyboard typing signals as speech activity even there is nobody talking in the room. As a result the vocally inactive site is not removed from the list of active sites and thus blocks another vocally active site from being added to the list, thus creating a very bad user experience in the video conference. Current  VAD often classify keyboard typing as active speech.

In this thesis work, we explore two main approaches to solve the problem. First approach is based on identification of keystroke signals in the mixed audio data (speech and keyboard signal). In this approach we explore various audio signal classification approaches based on temporal and spectral features of speech and keystroke signals as well as prediction model based classification. We evaluate and compare this approach by varying parameters and maximizing the percentage of correctly-classified keystroke frames as true-keystroke  frames whereas minimizing the falsely-classified keystroke frames among non true-keystroke  frames. The evaluated keystroke identification approach is based on thresholding the model error  that resulted into 85% accuracy using one previous and one future frame. The falsely-classified frames as keystroke frames in this approach are mainly due to the plosive sounds in the audio signal due to the similar characteristics as that of keystroke signal.

Second approach is based on finding a mechanism to complement VAD such that it doesn’t trigger at keystroke signals. For this purpose we explore different methods for improving pitch detection functionality in the VAD. We evaluate a new pitch detector  which computes pitch using autocorrelation  of the normalized signal frames. Then we design a new speech detector  which consists of the new pitch detector along with hangover addition that separates the mixed audio data into speech region and non-speech region in real time. The new speech detector  doesn’t trigger at keystroke frames i.e. it places the keystroke frames in non-speech region and hence solves the problem.

Place, publisher, year, edition, pages
IT, 13 001
National Category
Engineering and Technology
URN: urn:nbn:se:uu:diva-192096OAI: diva2:588802
Educational program
Master Programme in Computer Science
Available from: 2013-01-16 Created: 2013-01-16 Last updated: 2013-01-16Bibliographically approved

Open Access in DiVA

fulltext(3921 kB)590 downloads
File information
File name FULLTEXT01.pdfFile size 3921 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Department of Information Technology
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 590 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 391 hits
ReferencesLink to record
Permanent link

Direct link