Category : Speech Processing

Audio Analysis Publications Signal Processing Speech Processing

Signal Processing Applied To Video Mining: Video Boundaries Detection

Scene change detection is a technique which aims to identify automatically the scene change in a video. Assuming that a scene is defined by its audio and video signals, we present here scene change techniques based on audio and video signals. In the case of audio signal, the different techniques are based on abrupt variations of their frequency- and time-based features. For techniques based on video signals, the usual algorithms are based on the Sum of Absolute Differences (SAD) variation.

Read More
Audio Analysis Information Retrieval Machine Learning Publications Signal Processing Speech Processing Video Analysis

Automatic Emotion Recognition system using Deep Belief Network

Mood is a subjective term describing the emotional state of a human being. It can be expressed in textual form (e.g. twitter …). Let us remember that this topic is already addressed in our paper about sentiment analyses. On the other hand, mood can be recognized by analyzing facial expressions or/and the nature of voice. The speech-based Automatic Emotion Recognition (AER) systems which will be discussed here have several types of application, such as emotion detection in call centers, where being able to detect the emotion can be helpful in taking appropriate decisions. In the case of online video advertising, forecasting the emotion from speech signals in video can be useful to fine-tune the user targeting. Obviously, emotion detected from speech can be combined with facial expressions and textual information to improve accuracy. Here we will focus on Automatic Emotion Recognition based uniquely on an analysis of human speech. The system that will be presented is based on a recent machine learning technique: Deep Learning Network (DBN). It is an improvement on classical neural networks. We will describe the DBN and the database of emotional speech used to build such an AER system.

Read More
Information Retrieval Publications Speech Processing Video Analysis

UBM-GMM based Text-Independent Speaker Recognition

Speaker recognition systems are often used in the field of security, and a common example of their use is client voice authentication for some secured applications. Another application of speaker recognition is segmentation into homogenous parts of speech where each segment corresponds to one speaker’s speech. This process can also be very useful for improving the accuracy of speech recognition systems. Speech can also be used in the field of audio indexing. Recognizing the identity of speakers in a multi-speaker audio stream can provide some usable knowledge about its content. Two types of speaker recognition system exist: the text-dependent and text-independent systems. The first are speaker recognition systems where the verification texts and those saved during the enrollment phase are the same. As in online video indexing, the sentences are a priori unknown; we will focus here on text-independent systems. This paper is organized as follows: first we present the GMM classifier, and then the principle of LLR (LikeLihood Ratio) detection used to decide on the score given by a tested utterance.

Read More
Audio Analysis Information Retrieval Machine Learning Publications Speech Processing

Gaussian Mixture Model Supervectors

Gaussian Mixture Model (GMM) supervectors (GSV) are generally used in speaker recognition tasks. However, they can be used for the classification of audio events, especially when the training dataset is very limited. This is the case for the recognition of some types of sound, such as “gunshots”, where the variation from one sample to another is small (so the number of various stimuli of these types can be limited). Thus, in a supervised classification, rather than directly using the features vectors as the classifier input, they are transformed into GSV beforehand. This transformation aims at compensating the limitation of the stimuli variability in the training database. In the following, we will present an introduction to Gaussian Mixture Models, the core idea of GSV, and then we will present the GSV concept.

Read More
Publications Speech Processing Speech Recognition

Acoustic Feature Compensation using Vector Taylor Series

Environmental noise has a negative impact on the accuracy of speech and sound recognition systems. Background noise corrupts acoustic features of sound. Because automatic recognition models are usually trained on a database composed of “clean” signals (no background noise), the decoding is biased when the signal is corrupted by additive noise and channel distortion. To overcome this problem, several techniques are proposed in the literature. One solution is to denoise the speech signal to be decoded before it is processed by the ASR. In practice, this is done by applying noise reduction techniques based on Wiener Filter or Ephraim-Malah Filter. Another solution is to train the ASR system under a variety of environmental conditions. However, this solution requires a large memory capacity to store all the noisy signals. Another idea is to estimate the “noisy” acoustic model (HMM or Hidden Markov Models) from the “clean” acoustic model. Two common techniques using this approach are the Parallel Model Combination (PMC) [1] and Vector Taylor Series (VTS) compensation, which will be presented here.

Read More
Information Retrieval Publications Speech Processing Speech Recognition

PNCC features for ASR robustness enhancement

The acoustic features traditionally used in Speech and Audio Processing are MFCC and PLP. However, one important thing in designing an acoustic signal fingerprint is to use a robust feature. Consequently, several techniques aim to enhance MFCC and PLP by using for example, mean and variance normalization, variance normalization or RASTA filtering and variance normalization in the particular case of PLP.  Here, we present a new type of acoustic feature which directly implements a noise reduction algorithm: Power Normalised Cepstral Coefficients (PNCC) introduced by Chanwoo Kim [1]. This feature is more robust against background noise than the traditional features PLP and MFCC.

Read More
Publications Speech Processing Speech Recognition Video Analysis

Speaker normalization in ASR: Vocal Tract Length Normalization (VTLN)

In video content modeling, the ASR system used must be a speaker-independent one, since the speakers in the different videos are unknown. However, the accuracies of speaker-independent ASRs are less high than speaker-dependent ones due to speech variability from one speaker to another. Speech variability is due to some parameters such as pitch (fundamental frequency) and formant frequency. These parameters depend on the speaker. In speech production, while the vocal tract shape can affect the phonetic information and therefore have a great importance in speech recognition, the vocal tract length can be considered only as noise. This vocal length varies from 13 cm (women) to 18 cm (men). The formant center frequency, which depends on the vocal tract length, can vary considerably. Consequently, the acoustic features of the same speech pronounced by different speakers can vary significantly. To mitigate the problem of speech variability, two main solutions are used: speaker adaptation and speaker normalization. Here we present a technique based on the second solution: Frequency Warping based Vocal Tract Length Normalization (VTLN).

Read More
Information Retrieval Machine Learning Publications Speech Processing Speech Recognition

HMM-based ASR

ASR is a system whose purpose is to convert speech into text. Several types of ASR have been designed by speech processing researchers, however those based on the HMM algorithm are the most accurate. Here, we will focus on the principle of HMM.

Read More
Publications Speech Processing

Blind Audio Source Separation

Blind Audio Source Separation (BASS) is a crucial problem in the field of speech and audio processing. Its goal is to separate different sources from their mixture. In the case of audio mining, trying to analyze the content of the audio signal generally consists of designing a print or pattern of a given sound to recognize. Thus, one has to admit that when the signal is a mixture, it becomes difficult to extract suitable features allowing the design of a particular sound.

Read More
Information Retrieval Machine Learning Publications Speech Processing

Spoken Language Recognition

The objective of Spoken Language Identification (LID) is to recognize automatically the language spoken in an unknown speech signal. This system has several applications such as Speech-To-Speech Machine Translation Systems and telephone-based services. In the case of Automatic Speech Recognition (ASR), the LID allows the selection of appropriate parameters of the ASR system. There are two main types of LID system: there are LIDs based on spectral features and others based on tokens.

Read More
1 2