Tag Archives: ASR

Publications Speech Processing Speech Recognition Video Analysis

Speaker normalization in ASR: Vocal Tract Length Normalization (VTLN)

In video content modeling, the ASR system used must be a speaker-independent one, since the speakers in the different videos are unknown. However, the accuracies of speaker-independent ASRs are less high than speaker-dependent ones due to speech variability from one speaker to another. Speech variability is due to some parameters such as pitch (fundamental frequency) and formant frequency. These parameters depend on the speaker. In speech production, while the vocal tract shape can affect the phonetic information and therefore have a great importance in speech recognition, the vocal tract length can be considered only as noise. This vocal length varies from 13 cm (women) to 18 cm (men). The formant center frequency, which depends on the vocal tract length, can vary considerably. Consequently, the acoustic features of the same speech pronounced by different speakers can vary significantly. To mitigate the problem of speech variability, two main solutions are used: speaker adaptation and speaker normalization. Here we present a technique based on the second solution: Frequency Warping based Vocal Tract Length Normalization (VTLN).

Read More
Information Retrieval Machine Learning Publications Speech Recognition

Automatic Speech Recognition (ASR)

ASR aims to transcribe an unknown speech signal into text. This textual information can later be processed by a text mining system in order to spot essential keywords amongst the information provided by the input speech signal.

Read More