Audio Analysis Machine Learning Publications Signal Processing Video Analysis

Artificial Intelligence and Video Mining: Audio Event Detection Using SVM

In this paper we present a method aiming at analyzing the content an audio signal by using an artificial intelligence technique: Support Vector Machines (SVM). The objective is to detect the different events occurring in an unknown audio signal for information retrieval purposes. We present particularly the detection of violent events in a video. 

There are two types of data mining, depending on whether the aim is to describe or rather to predict. In the specific case of audio data mining, on the one hand there is a descriptive method consisting of classifying a set of audio signals into the most similar groups of signals from a perception viewpoint. This is unsupervised classification. On the other hand, there is the predictive method consisting in designing a model from a learning database. In this way, any new audio signal could be automatically classified on the basis of the built model. This method is the supervised classification. The present paper deals with the supervised classification.

There are various supervised classification algorithms, such as decision trees, neurone networks, etc. However, we chose Support Vector Machine (SVM) which, according to the literature gives good results for real-world applications.

Firstly, we will describe the database or corpus. In a second section, we will present features used to describe the stimuli of the corpus. The third part of the paper will be devoted to brief theory on SVM algorithm. Finally, we will present the results of our study before drawing conclusions from this work.


The objective is to detect screaming and gunshots sounds in an unknown video signal. Therefore we built a database composed of three categories (or classes) of audio signal corresponding: ScreamingGunshot (and Explosion) and Other. Each of the groups contains approximately the same number of stimuli in order to avoid bias. The last category (Other) contains speech, music and other environmental samples. Then, all the stimuli were resampled to 16 kHz before extracting time- and frequency-based features.

Acoustic Features

A total of 40 audio features or Low-Level Descriptors (LLDs) were extracted (cf. Table 1): 12 MFCC coefficients, 12 First Derivatives (Delta) of the 12 MFCC, Second Derivatives (Accelerations) of the 12 MFCC, Loudness, Intensity, Zero-Crossing Rate, and Voicing index. Then 5 statistics (cf. Table 1) were applied to these 40 features. Finally, 200  features were used to describe the database stimuli.



12 MFCC coefficients
12 Delta MFCC12 Delta Delta MFCC
Maximum value
Minimum value
Variation = Maximum value – Minimum value
Arithmetic mean
Intensity Standard deviation
Zero-Crossing Rate
Voicing index

Table 1: List of acoustical features and statistics


The purpose of SVM is either classification or regression. In this study we will focus on classification case. Let us consider the following set of  pairs:

where the  are N-dimensional elements of the training database and the  correspond to their respective labels. More explicitly, in our case  is the vector containing the 200 features of a signal  in one the classes and  its label, namely GunshotScreaming or Other. The  pairs  are called support vectors. The objective is to find a hyperplane that separates the training data into 2 groups:  and 

The equation of a hyperplane is :


The symbol  denotes the dot product, and this will be the same for all the following equations. Let us suppose that training data are separable, a discriminant function can be defined by:


Thus, for any training vector :


The hyperplanes whose equation are  and   are called margins. One can show that the distance between these two margins is .

On the one hand, to minimize classification errors, the distance between margins must be maximized. This is equivalent to minimize  (it is actually easier to minimize a quadratic equation).

On the other hand, from the equation (3), any training vector  is correctly classified, if and only if:


This lead to the following quadratic optimization problem [1]:


 is a function inducing the kernel function defined as . This function transposes training data into a space with a higher dimensionality in order to increase the possibility to find a hyperplane allowing a better separation of training data. The usual kernel functions are: linear, polynomial, sigmoid kernels and the RBF (Radial Basis Function). It is this last one that we used during our study and its equation is:


In order to overcome the problem of non separable data, we introduce variables  and  to relax the constraints. Consequently, equation (5) becomes:


During the training phase, first, for each of the stimuli the 200 features described above are extracted. Then features are scaled and the scaling parameters are saved in order to be used during the prediction phase. Each class (Screaming and Gunshot) is then modeled based on scaled features.  Since there are more than 2 classes, we chose the One-Against-All strategy [2]. This technique aims at designing for each label a binary-class problem. The final score is obtained by comparing individual scores.

The parameters  and  are refined thanks to a grid search and a cross-validation technique. A given test signal is first of all split into segment using a scene change detection based on acoustic features. Once, the test signal is split into homogenous segments (in acoustic features sense), the features in each segment are scaled using the scaling parameters issued from the training features. Finaly each segment is labeled based on the training model of each class.


[1] C. Cortes and V. Vapnik. Support-vector network. Machine Learning. 1995.

[2] J. Milgram et al. “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? 2006.

You may also like
Big Data: Basics, MapReduce
Spoken Language Recognition