Category : Video Analysis

Audio Analysis Machine Learning Publications Signal Processing Video Analysis

Artificial Intelligence and Video Mining: Audio Event Detection Using SVM

In this paper we present a method aiming at analyzing the content an audio signal by using an artificial intelligence technique: Support Vector Machines (SVM). The objective is to detect the different events occurring in an unknown audio signal for information retrieval purposes. We present particularly the detection of violent events in a video. 

There are two types of data mining, depending on whether the aim is to describe or rather to predict. In the specific case of audio data mining, on the one hand there is a descriptive method consisting of classifying a set of audio signals into the most similar groups of signals from a perception viewpoint. This is unsupervised classification. On the other hand, there is the predictive method consisting in designing a model from a learning database. In this way, any new audio signal could be automatically classified on the basis of the built model. This method is the supervised classification. The present paper deals with the supervised classification.

There are various supervised classification algorithms, such as decision trees, neurone networks, etc. However, we chose Support Vector Machine (SVM) which, according to the literature gives good results for real-world applications.

Firstly, we will describe the database or corpus. In a second section, we will present features used to describe the stimuli of the corpus. The third part of the paper will be devoted to brief theory on SVM algorithm. Finally, we will present the results of our study before drawing conclusions from this work.

Read More
Audio Analysis Machine Learning Publications Signal Processing Video Analysis

Random Forest Classifier and Bag of Audio Words concept applied to audio scene recognition

Bag of Audio Words (BoAW) is a concept inspired by the text mining research area. The idea is to represent any audio signal as a document of words. In this parallelism, each word corresponds to an acoustic feature.  The concept was successfully applied to image processing where the bag of visual words is generated using an unsupervised classifier like k-means. Here we will describe how to design a Bag of Words for the speech/audio signal case. Since the final goal is to build an audio/speech pattern recognition system, we will used as supervised classifier the Random Forest  (RF) classifier, which is well adapted to large data sets with a very high number of features. Moreover, it has some good robustness properties to guard against overfitting.

Read More
Audio Analysis Information Retrieval Machine Learning Publications Signal Processing Speech Processing Video Analysis

Automatic Emotion Recognition system using Deep Belief Network

Mood is a subjective term describing the emotional state of a human being. It can be expressed in textual form (e.g. twitter …). Let us remember that this topic is already addressed in our paper about sentiment analyses. On the other hand, mood can be recognized by analyzing facial expressions or/and the nature of voice. The speech-based Automatic Emotion Recognition (AER) systems which will be discussed here have several types of application, such as emotion detection in call centers, where being able to detect the emotion can be helpful in taking appropriate decisions. In the case of online video advertising, forecasting the emotion from speech signals in video can be useful to fine-tune the user targeting. Obviously, emotion detected from speech can be combined with facial expressions and textual information to improve accuracy. Here we will focus on Automatic Emotion Recognition based uniquely on an analysis of human speech. The system that will be presented is based on a recent machine learning technique: Deep Learning Network (DBN). It is an improvement on classical neural networks. We will describe the DBN and the database of emotional speech used to build such an AER system.

Read More
Computer Vision Image Processing Publications Video Analysis

Video Quality Assessment (VQA)

The growing ubiquity of audiovisual content on the Internet has increased the importance of online advertising. With technological improvements, the quality of digital video rendering keeps improving as well, which in turn makes the users’ requirements stricter and stricter. So, the video quality is an important element to consider in online video advertising. Indeed, it goes without saying that a high quality video is more likely to interest users than a low quality video. Therefore it is crucial to be able to quantify the video’s quality. Nevertheless, the multiplicity of video formats and the various types of communication networks (wireless, fiber, xDSL networks…) make video quality assessment complex. Since the “end receiver” of video is human, the most accurate VQA is subjective (by humans). However, subjective assessment is time-consuming, and it depends on the person who evaluates it (mood, culture…). Thus, researchers have considered building objective assessment methods to model subjective methods. The advantage of objective VQA is that they can operate in real time. We will focus here on objective assessment processes for quality of video signals.

Read More
Information Retrieval Publications Speech Processing Video Analysis

UBM-GMM based Text-Independent Speaker Recognition

Speaker recognition systems are often used in the field of security, and a common example of their use is client voice authentication for some secured applications. Another application of speaker recognition is segmentation into homogenous parts of speech where each segment corresponds to one speaker’s speech. This process can also be very useful for improving the accuracy of speech recognition systems. Speech can also be used in the field of audio indexing. Recognizing the identity of speakers in a multi-speaker audio stream can provide some usable knowledge about its content. Two types of speaker recognition system exist: the text-dependent and text-independent systems. The first are speaker recognition systems where the verification texts and those saved during the enrollment phase are the same. As in online video indexing, the sentences are a priori unknown; we will focus here on text-independent systems. This paper is organized as follows: first we present the GMM classifier, and then the principle of LLR (LikeLihood Ratio) detection used to decide on the score given by a tested utterance.

Read More
Publications Speech Processing Speech Recognition Video Analysis

Speaker normalization in ASR: Vocal Tract Length Normalization (VTLN)

In video content modeling, the ASR system used must be a speaker-independent one, since the speakers in the different videos are unknown. However, the accuracies of speaker-independent ASRs are less high than speaker-dependent ones due to speech variability from one speaker to another. Speech variability is due to some parameters such as pitch (fundamental frequency) and formant frequency. These parameters depend on the speaker. In speech production, while the vocal tract shape can affect the phonetic information and therefore have a great importance in speech recognition, the vocal tract length can be considered only as noise. This vocal length varies from 13 cm (women) to 18 cm (men). The formant center frequency, which depends on the vocal tract length, can vary considerably. Consequently, the acoustic features of the same speech pronounced by different speakers can vary significantly. To mitigate the problem of speech variability, two main solutions are used: speaker adaptation and speaker normalization. Here we present a technique based on the second solution: Frequency Warping based Vocal Tract Length Normalization (VTLN).

Read More
Computer Vision Image Processing Information Retrieval Publications Video Analysis

Image Segmentation

Image segmentation aims at splitting an image into partitions. These partitions should usually represent some real part of the global image. This technique is used in object identification (Face recognition, or relevant information retrieval) in digital images. There are many different ways to perform image segmentation, such as image thresholding, region-based segmentation and Hough’s Transform.

Read More
Information Retrieval Publications Video Analysis

Edge Detection

Edge detection is an essential step in any Computer Vision (CV) system. It is one of the principal steps of a human vision system. In fact, the Human Visual System (HVS) has cells for contour detection. This step reduces the amount of information to be retained, keeping only what is essential. Edge detection can be seen as an abrupt change in the intensity at any location of the image. In CV it is used for image segmentation, or identification of an object in an image.

Read More