Category : Speech Recognition

Publications Speech Processing Speech Recognition

Acoustic Feature Compensation using Vector Taylor Series

Environmental noise has a negative impact on the accuracy of speech and sound recognition systems. Background noise corrupts acoustic features of sound. Because automatic recognition models are usually trained on a database composed of “clean” signals (no background noise), the decoding is biased when the signal is corrupted by additive noise and channel distortion. To overcome this problem, several techniques are proposed in the literature. One solution is to denoise the speech signal to be decoded before it is processed by the ASR. In practice, this is done by applying noise reduction techniques based on Wiener Filter or Ephraim-Malah Filter. Another solution is to train the ASR system under a variety of environmental conditions. However, this solution requires a large memory capacity to store all the noisy signals. Another idea is to estimate the “noisy” acoustic model (HMM or Hidden Markov Models) from the “clean” acoustic model. Two common techniques using this approach are the Parallel Model Combination (PMC) [1] and Vector Taylor Series (VTS) compensation, which will be presented here.

Read More
Information Retrieval Publications Speech Processing Speech Recognition

PNCC features for ASR robustness enhancement

The acoustic features traditionally used in Speech and Audio Processing are MFCC and PLP. However, one important thing in designing an acoustic signal fingerprint is to use a robust feature. Consequently, several techniques aim to enhance MFCC and PLP by using for example, mean and variance normalization, variance normalization or RASTA filtering and variance normalization in the particular case of PLP.  Here, we present a new type of acoustic feature which directly implements a noise reduction algorithm: Power Normalised Cepstral Coefficients (PNCC) introduced by Chanwoo Kim [1]. This feature is more robust against background noise than the traditional features PLP and MFCC.

Read More
Publications Speech Processing Speech Recognition Video Analysis

Speaker normalization in ASR: Vocal Tract Length Normalization (VTLN)

In video content modeling, the ASR system used must be a speaker-independent one, since the speakers in the different videos are unknown. However, the accuracies of speaker-independent ASRs are less high than speaker-dependent ones due to speech variability from one speaker to another. Speech variability is due to some parameters such as pitch (fundamental frequency) and formant frequency. These parameters depend on the speaker. In speech production, while the vocal tract shape can affect the phonetic information and therefore have a great importance in speech recognition, the vocal tract length can be considered only as noise. This vocal length varies from 13 cm (women) to 18 cm (men). The formant center frequency, which depends on the vocal tract length, can vary considerably. Consequently, the acoustic features of the same speech pronounced by different speakers can vary significantly. To mitigate the problem of speech variability, two main solutions are used: speaker adaptation and speaker normalization. Here we present a technique based on the second solution: Frequency Warping based Vocal Tract Length Normalization (VTLN).

Read More
Information Retrieval Machine Learning Publications Speech Processing Speech Recognition

HMM-based ASR

ASR is a system whose purpose is to convert speech into text. Several types of ASR have been designed by speech processing researchers, however those based on the HMM algorithm are the most accurate. Here, we will focus on the principle of HMM.

Read More
Information Retrieval Machine Learning Publications Speech Recognition

Automatic Speech Recognition (ASR)

ASR aims to transcribe an unknown speech signal into text. This textual information can later be processed by a text mining system in order to spot essential keywords amongst the information provided by the input speech signal.

Read More
Information Retrieval Publications Signal Processing Speech Processing Speech Recognition

Keyword Spotting

Keyword spotting (KWS) or Spoken Term Detection (SPT) is a subcategory of Automatic Speech Recognition (ASR). Contrarily to ASR whose objective is to transcribe a speech in its entirety, KWS must detect only a predefined set of words.

Read More