In this paper we present a method aiming at analyzing the content an audio signal by using an artificial intelligence technique: Support Vector Machines (SVM). The objective is to detect the different events occurring in an unknown audio signal for information retrieval purposes. We present particularly the detection of violent events in a video.
There are two types of data mining, depending on whether the aim is to describe or rather to predict. In the specific case of audio data mining, on the one hand there is a descriptive method consisting of classifying a set of audio signals into the most similar groups of signals from a perception viewpoint. This is unsupervised classification. On the other hand, there is the predictive method consisting in designing a model from a learning database. In this way, any new audio signal could be automatically classified on the basis of the built model. This method is the supervised classification. The present paper deals with the supervised classification.
There are various supervised classification algorithms, such as decision trees, neurone networks, etc. However, we chose Support Vector Machine (SVM) which, according to the literature gives good results for real-world applications.
Firstly, we will describe the database or corpus. In a second section, we will present features used to describe the stimuli of the corpus. The third part of the paper will be devoted to brief theory on SVM algorithm. Finally, we will present the results of our study before drawing conclusions from this work.
The objective is to detect screaming and gunshots sounds in an unknown video signal. Therefore we built a database composed of three categories (or classes) of audio signal corresponding: Screaming, Gunshot (and Explosion) and Other. Each of the groups contains approximately the same number of stimuli in order to avoid bias. The last category (Other) contains speech, music and other environmental samples. Then, all the stimuli were resampled to 16 kHz before extracting time- and frequency-based features.
A total of 40 audio features or Low-Level Descriptors (LLDs) were extracted (cf. Table 1): 12 MFCC coefficients, 12 First Derivatives (Delta) of the 12 MFCC, Second Derivatives (Accelerations) of the 12 MFCC, Loudness, Intensity, Zero-Crossing Rate, and Voicing index. Then 5 statistics (cf. Table 1) were applied to these 40 features. Finally, 200 features were used to describe the database stimuli.
|12 MFCC coefficients
12 Delta MFCC12 Delta Delta MFCC
|Variation = Maximum value – Minimum value|
Table 1: List of acoustical features and statistics
The purpose of SVM is either classification or regression. In this study we will focus on classification case. Let us consider the following set of pairs:
where the are N-dimensional elements of the training database and the correspond to their respective labels. More explicitly, in our case is the vector containing the 200 features of a signal in one the classes and its label, namely Gunshot, Screaming or Other. The pairs are called support vectors. The objective is to find a hyperplane that separates the training data into 2 groups: and
The equation of a hyperplane is :
The symbol denotes the dot product, and this will be the same for all the following equations. Let us suppose that training data are separable, a discriminant function can be defined by:
Thus, for any training vector :
The hyperplanes whose equation are and are called margins. One can show that the distance between these two margins is .
On the one hand, to minimize classification errors, the distance between margins must be maximized. This is equivalent to minimize (it is actually easier to minimize a quadratic equation).
On the other hand, from the equation (3), any training vector is correctly classified, if and only if:
This lead to the following quadratic optimization problem :
is a function inducing the kernel function defined as . This function transposes training data into a space with a higher dimensionality in order to increase the possibility to find a hyperplane allowing a better separation of training data. The usual kernel functions are: linear, polynomial, sigmoid kernels and the RBF (Radial Basis Function). It is this last one that we used during our study and its equation is:
In order to overcome the problem of non separable data, we introduce variables and to relax the constraints. Consequently, equation (5) becomes:
During the training phase, first, for each of the stimuli the 200 features described above are extracted. Then features are scaled and the scaling parameters are saved in order to be used during the prediction phase. Each class (Screaming and Gunshot) is then modeled based on scaled features. Since there are more than 2 classes, we chose the One-Against-All strategy . This technique aims at designing for each label a binary-class problem. The final score is obtained by comparing individual scores.
The parameters and are refined thanks to a grid search and a cross-validation technique. A given test signal is first of all split into segment using a scene change detection based on acoustic features. Once, the test signal is split into homogenous segments (in acoustic features sense), the features in each segment are scaled using the scaling parameters issued from the training features. Finaly each segment is labeled based on the training model of each class.
 C. Cortes and V. Vapnik. Support-vector network. Machine Learning. 1995.
 J. Milgram et al. “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? 2006.