Information Retrieval Publications Speech Processing Video Analysis

UBM-GMM based Text-Independent Speaker Recognition

Speaker recognition systems are often used in the field of security, and a common example of their use is client voice authentication for some secured applications. Another application of speaker recognition is segmentation into homogenous parts of speech where each segment corresponds to one speaker’s speech. This process can also be very useful for improving the accuracy of speech recognition systems. Speech can also be used in the field of audio indexing. Recognizing the identity of speakers in a multi-speaker audio stream can provide some usable knowledge about its content. Two types of speaker recognition system exist: the text-dependent and text-independent systems. The first are speaker recognition systems where the verification texts and those saved during the enrollment phase are the same. As in online video indexing, the sentences are a priori unknown; we will focus here on text-independent systems. This paper is organized as follows: first we present the GMM classifier, and then the principle of LLR (LikeLihood Ratio) detection used to decide on the score given by a tested utterance.

Read More
Audio Analysis Information Retrieval Machine Learning Publications Speech Processing

Gaussian Mixture Model Supervectors

Gaussian Mixture Model (GMM) supervectors (GSV) are generally used in speaker recognition tasks. However, they can be used for the classification of audio events, especially when the training dataset is very limited. This is the case for the recognition of some types of sound, such as “gunshots”, where the variation from one sample to another is small (so the number of various stimuli of these types can be limited). Thus, in a supervised classification, rather than directly using the features vectors as the classifier input, they are transformed into GSV beforehand. This transformation aims at compensating the limitation of the stimuli variability in the training database. In the following, we will present an introduction to Gaussian Mixture Models, the core idea of GSV, and then we will present the GSV concept.

Read More
Publications Speech Processing Speech Recognition

Acoustic Feature Compensation using Vector Taylor Series

Environmental noise has a negative impact on the accuracy of speech and sound recognition systems. Background noise corrupts acoustic features of sound. Because automatic recognition models are usually trained on a database composed of “clean” signals (no background noise), the decoding is biased when the signal is corrupted by additive noise and channel distortion. To overcome this problem, several techniques are proposed in the literature. One solution is to denoise the speech signal to be decoded before it is processed by the ASR. In practice, this is done by applying noise reduction techniques based on Wiener Filter or Ephraim-Malah Filter. Another solution is to train the ASR system under a variety of environmental conditions. However, this solution requires a large memory capacity to store all the noisy signals. Another idea is to estimate the “noisy” acoustic model (HMM or Hidden Markov Models) from the “clean” acoustic model. Two common techniques using this approach are the Parallel Model Combination (PMC) [1] and Vector Taylor Series (VTS) compensation, which will be presented here.

Read More
Information Retrieval Publications Speech Processing Speech Recognition

PNCC features for ASR robustness enhancement

The acoustic features traditionally used in Speech and Audio Processing are MFCC and PLP. However, one important thing in designing an acoustic signal fingerprint is to use a robust feature. Consequently, several techniques aim to enhance MFCC and PLP by using for example, mean and variance normalization, variance normalization or RASTA filtering and variance normalization in the particular case of PLP.  Here, we present a new type of acoustic feature which directly implements a noise reduction algorithm: Power Normalised Cepstral Coefficients (PNCC) introduced by Chanwoo Kim [1]. This feature is more robust against background noise than the traditional features PLP and MFCC.

Read More
Publications Speech Processing Speech Recognition Video Analysis

Speaker normalization in ASR: Vocal Tract Length Normalization (VTLN)

In video content modeling, the ASR system used must be a speaker-independent one, since the speakers in the different videos are unknown. However, the accuracies of speaker-independent ASRs are less high than speaker-dependent ones due to speech variability from one speaker to another. Speech variability is due to some parameters such as pitch (fundamental frequency) and formant frequency. These parameters depend on the speaker. In speech production, while the vocal tract shape can affect the phonetic information and therefore have a great importance in speech recognition, the vocal tract length can be considered only as noise. This vocal length varies from 13 cm (women) to 18 cm (men). The formant center frequency, which depends on the vocal tract length, can vary considerably. Consequently, the acoustic features of the same speech pronounced by different speakers can vary significantly. To mitigate the problem of speech variability, two main solutions are used: speaker adaptation and speaker normalization. Here we present a technique based on the second solution: Frequency Warping based Vocal Tract Length Normalization (VTLN).

Read More
Computer Vision Image Processing Information Retrieval Publications Video Analysis

Image Segmentation

Image segmentation aims at splitting an image into partitions. These partitions should usually represent some real part of the global image. This technique is used in object identification (Face recognition, or relevant information retrieval) in digital images. There are many different ways to perform image segmentation, such as image thresholding, region-based segmentation and Hough’s Transform.

Read More
Information Retrieval Publications Video Analysis

Edge Detection

Edge detection is an essential step in any Computer Vision (CV) system. It is one of the principal steps of a human vision system. In fact, the Human Visual System (HVS) has cells for contour detection. This step reduces the amount of information to be retained, keeping only what is essential. Edge detection can be seen as an abrupt change in the intensity at any location of the image. In CV it is used for image segmentation, or identification of an object in an image.

Read More
Information Retrieval Machine Learning Publications Speech Processing Speech Recognition

HMM-based ASR

ASR is a system whose purpose is to convert speech into text. Several types of ASR have been designed by speech processing researchers, however those based on the HMM algorithm are the most accurate. Here, we will focus on the principle of HMM.

Read More
BigData Information Retrieval Publications

Big Data: Basics, MapReduce

The evolution of the Internet has generated an exponential growth in the volume of data to be stored and processed. This data is usually unstructured and very diverse (log files, online discussions, user traffic logs, banking, weather or satellite information, etc.). The great volume and diversity of the information often makes it impossible to use conventional technology to store and process data based on relational systems (RDBMS) or objects. A new approach has emerged in the last few years, discarding relational concepts (decomposition into normal form, relational algebra) and based on using structured data manipulation languages like SQL (Structured Query Language), giving birth to the growing family of “big data” technologies, also known as “NoSQL.”

Most big data systems rely on distributed storage solutions for structured data (Google/BigTable), parallel and distributed processing methods, and the MapReduce concept this document addresses. These technologies, now mature, have given rise to various open source or commercial applications like Hadoop [1] (Apache), Cassandra [2] (Facebook) or MongoDB [3].

Read More
Publications Speech Processing

Blind Audio Source Separation

Blind Audio Source Separation (BASS) is a crucial problem in the field of speech and audio processing. Its goal is to separate different sources from their mixture. In the case of audio mining, trying to analyze the content of the audio signal generally consists of designing a print or pattern of a given sound to recognize. Thus, one has to admit that when the signal is a mixture, it becomes difficult to extract suitable features allowing the design of a particular sound.

Read More
1 2 3