Audio Analysis Machine Learning Publications Signal Processing Video Analysis

Random Forest Classifier and Bag of Audio Words concept applied to audio scene recognition

Bag of Audio Words (BoAW) is a concept inspired by the text mining research area. The idea is to represent any audio signal as a document of words. In this parallelism, each word corresponds to an acoustic feature.  The concept was successfully applied to image processing where the bag of visual words is generated using an unsupervised classifier like k-means. Here we will describe how to design a Bag of Words for the speech/audio signal case. Since the final goal is to build an audio/speech pattern recognition system, we will used as supervised classifier the Random Forest  (RF) classifier, which is well adapted to large data sets with a very high number of features. Moreover, it has some good robustness properties to guard against overfitting.

Bag of Words concept

Here, we apply the BoAW concept for the purpose of audio scene recognition. The objective is to apply a supervised learning technique to a set of labelled training data. However, to do this, we must follow three main steps.  The first is the extraction of acoustic features (MFCC, PLP, etc.). The second step consists of quantifying the MFCC matrices using a codebook. And the third step is to design the predictive model by applying a chosen supervised learning technique to these “quantified” features.

In the audio signal case, feature extraction involves transposing the temporal raw signal towards a space which is more correlated to human perception. For the sake of simplicity, we will consider as features the MFCC features. Each temporal signal is transformed into an MFCC matrix composed of  rows and  columns where the columns represent the features and the rows are overlapped temporal frames of the original signal.

Since we need some features to feed the supervised classifier further, MFCC matrices must be transformed into vectors while keeping essential information. Usually we concatenate some statistical information about the matrix rows (such as “mean”, “percentile” and “variance”) etc. However, instead of this process, the core idea of BoAW is to quantify the matrix using a fixed sized codebook. To do this, we apply an unsupervised classifier: “k-means”.

Codebook design

Let us suppose that the training data set is composed of  samples whose MFCC matrices are respectively denoted . In our case, the element  is the  cepstral coefficient of the  frame of the sample indexed by . Firstly, all the  matrices are concatenated and the resulting matrix is denoted. The objective of k-means is to cluster a given set of points in  clusters  so that the within-cluster sum of squares is minimal.

Denoting by  the respective mean of the points (center) of each cluster  the goal of k-means is to find the set of clusters  such that:

After the k-means process, each of the MFCC matrices is quantified using as codebook the set of clusters . We can notice that the variable  is an optimization parameter.

All the  matrices are represented by their vector  whose elements are the index of the cluster to which each of the frames belongs. Then, they are represented by their histogram . Finally each sample  of the training data set is associated to the pair  where  is the label of .

Supervised Classification Phase

To build the supervised model we decided to use the well-known “Random Forest Classifier” invented by Breiman in [1]. This technique is based on two main concepts: “Bagging” (Bootstrap Aggregating) and the random subspaces method [3].

The bagging method uses the “bootstrap” resampling technique presented by Efron in [2]. Given an initial training data set, the method consists of generating by sampling with replacement some new training sets   with a size which is generally equal to.

The bagging technique can be applied to all types of classifier, however it has been demonstrated experimentally that its efficiency is more relevant for unstable classifiers.

 A classifier is said to be unstable if small variations in the training data set lead to some large difference in classification. The advantage of such a technique is to reduce the variance while preserving the bias.

The technique of random subspace is applied to random selection of features at each node of each tree. It aims to randomly select a subset of features that will be used to search for the optimal feature for data portioning at each node. This technique is very effective, especially in cases where the number of features  is higher than in the samples (which can lead to the problem of curse of dimensionality). This number is an optimization parameter.

This second technique aims at making all the trees as independent as possible. In the classic algorithm of RF, the CART (Classification And Regression Tree) is usually used [3]. This is a binary tree based on the Gini index criterion (an impurity criterion) for the data partition at each node.

If the initial learning database is the sample  where the variables  have  as size. Let us denote  the number of bootstraps (number of trees); then the RF algorithm is as follows:

For  ranging from 1 to :

1. Generate a bootstrap sample  from the initial training set  with an identical size.

2.  Build a CART tree using as training data set the bootstrap sample . At each node of  select  random features from the initial set of  features to search for the optimal feature according to the Gini index.

Let us denote  the classification result of an element  by the tree , the class of a new element  is predicted as follows: 

The parameters of optimization are among others the number of trees  and the number  of features selected randomly at each node. The optimal parameters can be obtained via a cross-validation process. An advantage provided by the bagging technique is that we do not need to split the training data set into two parts (training and test sets). In fact, the bagging process is done by sampling with replacement, so statistically approximately 1/3 of the sampling data is not used for the design of each tree. The data set which does not participate in the tree design at each node is called “Out-Of-Bag” (OOB). The cross-validation can therefore be monitored by analyzing the OOB classification error.


[1] L. Breiman. Random forests. Machine Learning, 45:5-32, 2001.

[2] B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.

[3] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees, Wadsworth, Belmont, CA. Republished by CRC Press. 1984.

[4] Tin Kam Ho. “The Random Subspace Method for Constructing Decision Forests”. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998.

You may also like
Auction Theory
Signal Processing Applied To Video Mining: Video Boundaries Detection