Audio Analysis Information Retrieval Machine Learning Publications Signal Processing Speech Processing Video Analysis

Mood is a subjective term describing the emotional state of a human being. It can be expressed in textual form (e.g. twitter …). Let us remember that this topic is already addressed in our paper about sentiment analyses. On the other hand, mood can be recognized by analyzing facial expressions or/and the nature of voice. The speech-based Automatic Emotion Recognition (AER) systems which will be discussed here have several types of application, such as emotion detection in call centers, where being able to detect the emotion can be helpful in taking appropriate decisions. In the case of online video advertising, forecasting the emotion from speech signals in video can be useful to fine-tune the user targeting. Obviously, emotion detected from speech can be combined with facial expressions and textual information to improve accuracy. Here we will focus on Automatic Emotion Recognition based uniquely on an analysis of human speech. The system that will be presented is based on a recent machine learning technique: Deep Learning Network (DBN). It is an improvement on classical neural networks. We will describe the DBN and the database of emotional speech used to build such an AER system.

Neural Network

Neural Network is a technique used in the Artificial Intelligence (AI) field, inspired by biological neurons, where the objective is to mimic the behavior of biological neurons. In classical neural networks, there are either one or two hidden layers. However, recent advances in AI research have introduced some new models that involve more than two hidden layers: these types of neural network are “Deep Learning” systems. To understand the functioning of deep learning, we will incrementally present the different steps of neural network evolution in artificial intelligence research. The basic unit of a neural network is the artificial neuron, which is designed on the principle of the biological human neuron, composed of four main parts:

- The dendrites are the inputs of the neurons which go towards other neurons
- The axon represents the output of the neurons.
- The nucleus activates the output according to the input signals.
- The synapses, which are the connection point with the other neurons, muscular and nervous fibers.

The perceptron, introduced in the 1950s by Frank Rosenblatt, tries to mimic the biological system described above. Denoting the input signals of a basic perceptron, the output state is defined as follows:

where the function is called the “activation function”. The weight is the neuron “bias”. The objective in building a neuron model is to find the optimal which minimizes the loss function. This is usually solved using stochastic gradient descent or its variant algorithms.

Dendrite |
Input to artificial neuron |

Axon |
Output from artificial neuron |

Synapses |
Weights |

Nucleus |
Activation function |

According to the nature of , different types of neuron can be defined. The most common activation functions are:

- Identity function

- Sigmoid function

- Radial function

- Tanh function

One of the drawbacks of a single perceptron is the fact that it is only able to solve linearly separable problems. To overcome this weakness, feed forward multilayer perceptrons (MLP) have been introduced. The layers between input and output layers are called hidden layers .The units of layers are all connected to those of layer . When the number of hidden layers is high, we speak of a ‘deep neural network’.

Deep learning architectures were not being used by researchers because several studies had demonstrated that their training phase was more difficult when using a random initialization [1]. DBN is a type of Deep Neural Network invented by Hinton in 2006 [2].

A DBN consists of stacks of Restricted Boltzmann Machines (RBM) [3] where the output of each RBM of the hidden layer is used as the input of the hidden layer. Each layer of the network tries to model the distribution of its input, a Restricted Boltzmann Machine. The pre-training phase using RBM networks aims to replace the random initialization of the weights phase of classical neural networks. The Boltzmann Machine (BM) is a generative neural network composed of two layers: hidden and visible. RBM is a particular case of BM where connections between units from the same type of layer (visible – visible and hidden – hidden connections) are not allowed. The RBM is usually used as an unsupervised features detector when the training samples are not labeled; however it can also be used in a supervised context.

Given an RBM, for binary states, an energy function can be defined for each pair of hidden and visible states vectors as follows

where is a the symmetric matrix of the weights connecting the visible and hidden units. The vectors

and are composed of coefficients that connect the bias to the visible and hidden units respectively. If we denote by the partition function:

The joint probability in an RBM is defined as follows:

The distributions of a visible and hidden vector are:

Due to the lack of connections between units from different types of layer, the visible units are conditionally independent given the hidden variables and vice-versa. Consequently

and

In the case of binary units, we can easily demonstrate that the simplified expressions for and are:

where is the sigmoid function.

During the training phase of an RBM, the objective is to maximize the log-likelihood of the training data set and this leads to the following updating rule:

where represents the expectation symbol. While the first term of the above equation is easily tractable, the second is intractable and is usually approximated thanks to the contrastive divergence (CD) [3]. According to Yoshua Benjo [4], CD is a recipe for training undirected graphical models (a class of probabilistic models used in machine learning). It relies on an approximation of the gradient (a good direction of change for the parameters) of the log-likelihood based on a short Markov chain (a way to sample from probabilistic models). Hence, in the above updating equation the symbol is replaced by which represents the expectation with respect to the distribution of samples from running the Gibbs sampler initialized at the data for full steps (usually ).

Other better algorithms, such Persistent Contrastive Divergence (PCD), also known as Stochastic Maximum Likelihood (SML), can be used to solve the optimization problem.

To design the AER from a speech system based on DBN, several databases are available. A significant and free database in the German language is the Berlin one, which contains 535 German utterances of seven types of emotion: Anger, joy, sadness, fear, disgust, boredom, and neutral.

To feed the DBN system, we extract spectral acoustical features such as MFCC and other types of feature (prosodic ones) such as pitch, energy, zero crossing rate, etc. The first input layer must have a number of units equal to the total number of features. The number of hidden layers and the number of units in each of them is difficult to set a priori. Usually a cross validation is helpful to determine an optimal architecture. Otherwise, Hinton gives some tips to determine these hyperparameters (number of hidden layers and their units) in [5].

[1] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why Does Unsupervised Pre-training Help Deep Learning? *Journal of Machine Learning Research*, 11(Feb):625-660, 2010.

[2] Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554, 2006.

[3] A. Mohamed, G.E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

[4] https://www.quora.com/What-is-contrastive-divergence

[5] A Practical Guide to Training Restricted Boltzmann Machines, Version 1, Geoffrey Hinton. August 2010

Acoustic FeaturesAcoustic ModelsArtificial IntelligenceASRaudio featuresAudio FingerprintsAudio signal processingAudio time and frequency indicatorsBigDataBigTableBlind Audio Source SeparationCassandraComputation VisionComputer VisionEdge DetectionGMMGrammarHadoopHMMImage ProcessingInformation RetrievalKeyword spottingLanguage ModellingLatent Dirichlet AllocationLatent Semantic AnalysisLexiconMAPMapReduceMFCCNoise ReductionPhone RecognitionPLPProbalistic Latent Semantic AnalysisShifted Delta CepstralSimilarity MeasuresSpeech EnhancementSpeech ProcessingSupervised automatic learningSVMSVM and GMM classificationsText MiningUBMvideo boundariesvideo featuresWiener Filter