Bag of Audio Words (BoAW) is a concept inspired by the text mining research area. The idea is to represent any audio signal as a document of words. In this parallelism, each word corresponds to an acoustic feature. The concept was successfully applied to image processing where the bag of visual words is generated using an unsupervised classifier like *k*-means. Here we will describe how to design a Bag of Words for the speech/audio signal case. Since the final goal is to build an audio/speech pattern recognition system, we will used as supervised classifier the Random Forest (RF) classifier, which is well adapted to large data sets with a very high number of features. Moreover, it has some good robustness properties to guard against overfitting.

Here, we apply the BoAW concept for the purpose of audio scene recognition. The objective is to apply a supervised learning technique to a set of labelled training data. However, to do this, we must follow three main steps. The first is the extraction of acoustic features (MFCC, PLP, etc.). The second step consists of quantifying the MFCC matrices using a codebook. And the third step is to design the predictive model by applying a chosen supervised learning technique to these “quantified” features.

In the audio signal case, feature extraction involves transposing the temporal raw signal towards a space which is more correlated to human perception. For the sake of simplicity, we will consider as features the MFCC features. Each temporal signal is transformed into an MFCC matrix composed of rows and columns where the columns represent the features and the rows are overlapped temporal frames of the original signal.

Since we need some features to feed the supervised classifier further, MFCC matrices must be transformed into vectors while keeping essential information. Usually we concatenate some statistical information about the matrix rows (such as “mean”, “percentile” and “variance”) etc. However, instead of this process, the core idea of BoAW is to quantify the matrix using a fixed sized codebook. To do this, we apply an unsupervised classifier: “*k*-means”.

Let us suppose that the training data set is composed of samples whose MFCC matrices are respectively denoted . In our case, the element is the cepstral coefficient of the frame of the sample indexed by . Firstly, all the matrices are concatenated and the resulting matrix is denoted. The objective of *k*-means is to cluster a given set of points in clusters so that the within-cluster sum of squares is minimal.

Denoting by the respective mean of the points (center) of each cluster the goal of *k*-means is to find the set of clusters such that:

After the *k*-means process, each of the MFCC matrices is quantified using as codebook the set of clusters . We can notice that the variable is an optimization parameter.

All the matrices are represented by their vector whose elements are the index of the cluster to which each of the frames belongs. Then, they are represented by their histogram . Finally each sample of the training data set is associated to the pair where is the label of .

To build the supervised model we decided to use the well-known “Random Forest Classifier” invented by Breiman in [1]. This technique is based on two main concepts: “Bagging” (*B*ootstrap *Agg*regat*ing*) and the random subspaces method [3].

The bagging method uses the “bootstrap” resampling technique presented by Efron in [2]. Given an initial training data set, the method consists of generating by sampling with replacement some new training sets with a size which is generally equal to.

The bagging technique can be applied to all types of classifier, however it has been demonstrated experimentally that its efficiency is more relevant for unstable classifiers.

A classifier is said to be unstable if small variations in the training data set lead to some large difference in classification. The advantage of such a technique is to reduce the variance while preserving the bias.

The technique of random subspace is applied to random selection of features at each node of each tree. It aims to randomly select a subset of features that will be used to search for the optimal feature for data portioning at each node. This technique is very effective, especially in cases where the number of features is higher than in the samples (which can lead to the problem of curse of dimensionality). This number is an optimization parameter.

This second technique aims at making all the trees as independent as possible. In the classic algorithm of RF, the CART (Classification And Regression Tree) is usually used [3]. This is a binary tree based on the Gini index criterion (an impurity criterion) for the data partition at each node.

If the initial learning database is the sample where the variables have as size. Let us denote the number of bootstraps (number of trees); then the RF algorithm is as follows:

For ranging from 1 to :

1. Generate a bootstrap sample from the initial training set with an identical size.

2. Build a CART tree using as training data set the bootstrap sample . At each node of select random features from the initial set of features to search for the optimal feature according to the Gini index.

Let us denote the classification result of an element by the tree , the class of a new element is predicted as follows:

The parameters of optimization are among others the number of trees and the number of features selected randomly at each node. The optimal parameters can be obtained via a cross-validation process. An advantage provided by the bagging technique is that we do not need to split the training data set into two parts (training and test sets). In fact, the bagging process is done by sampling with replacement, so statistically approximately 1/3 of the sampling data is not used for the design of each tree. The data set which does not participate in the tree design at each node is called “Out-Of-Bag” (OOB). The cross-validation can therefore be monitored by analyzing the OOB classification error.

[1] L. Breiman. Random forests. Machine Learning, 45:5-32, 2001.

[2] B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.

[3] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees, Wadsworth, Belmont, CA. Republished by CRC Press. 1984.

[4] Tin Kam Ho. “The Random Subspace Method for Constructing Decision Forests”. *IEEE Transactions on Pattern Analysis and Machine Intelligence*. 1998.

Acoustic FeaturesAcoustic ModelsArtificial IntelligenceASRaudio featuresAudio FingerprintsAudio signal processingAudio time and frequency indicatorsBigDataBigTableBlind Audio Source SeparationCassandraComputation VisionComputer VisionEdge DetectionGMMGrammarHadoopHMMImage ProcessingInformation RetrievalKeyword spottingLanguage ModellingLatent Dirichlet AllocationLatent Semantic AnalysisLexiconMAPMapReduceMFCCNoise ReductionPhone RecognitionPLPProbalistic Latent Semantic AnalysisShifted Delta CepstralSimilarity MeasuresSpeech EnhancementSpeech ProcessingSupervised automatic learningSVMSVM and GMM classificationsText MiningUBMvideo boundariesvideo featuresWiener Filter