As one of the main information mediums in human communication, speech contains not only basic language information, but also a wealth of emotional information. Emotion can help people understand real expressions and potential intentions. Speech emotion recognition (SER) has many applications in human-computer interactions, since it can help machines to understand emotional states like human beings do. For example, speech emotion recognition can be utilized to monitor customers′ emotional state which reflects their service quality in call centers. The information can help promote service level and reduce the workload of manual evaluation.
Emotion is conventionally represented as several discrete human emotional moods such as happiness, sadness and anger over utterances. In speech emotion recognition, the establishment of a speech emotional database is based on the reality that every speech utterance is assigned to a certain one of emotional categories. As a result, most researchers regard speech emotion recognition as a typical supervised learning task. Given the emotional database, the classification models are trained to predict exact emotional labels for each utterance. Thus, lots of conventional machine learning methods were applied successfully in speech emotion recognition. The models, hidden Markov models (HMMs) and Gaussian mixture models (GMMs) which emphasize the temporality of speech signal and had achieved great performance in speech recognition, were also applied in SER. Support vector machines (SVMs), which have the superiority of modeling small data sets, usually achieved better performance than other alterative models. Inspired by the success of various tasks with deep learning, numerous research efforts have been made to build an effective speech emotion recognition model with deep neural networks (DNN), leading to impressive achievement.
However, speech emotion recognition still faces many challenges, such as the diversity of speakers, genders, languages and cultures which would influence the system performance. The difference of recording conditions is also bad for the stability of the system. While automatic systems have been shown to outperform naive human listeners on speech emotion classification, existing SER systems are not so mature compared with speech and image classification tasks. One of the serious problems is the shortage of emotional data that limits the robustness of the models.
Supervised classification methods estimate emotional class by learning the differences between different categories. The guarantee of a large enough number of labelled speech emotional data is necessary for the exactitude of the separatrices. However, the acquisition of labelled data demands experts′ knowledge and is also highly time consuming. Even worse, there exists large ambiguity and subjectivity among the boundaries of the emotions since the expressions and perceptions of different people are different. Thus, there is no definite standard for providing emotional labels. Due to these shortcomings, the quantity of speech emotion databases is limited, and cannot cover the diversity of different conditions.
Considering the scarcity of speech emotion data, it is beneficial to take full advantage of the information from unlabeled data. Unsupervised learning is one choice which extracts robust feature representations from the data automatically without depending on label information. This technique can depict the intrinsic structures of the data, and has stronger modeling and generalization ability for training better classification models. Most of the existing unsupervised feature learning approaches have been explored to generate salient emotional feature representations for speech emotion recognition, such as autoencoders (AE) and denoising autoencoders (DAE). The purpose of AE and DAE is to obtain intermediate feature representations which can rebuild the input data as much as possible. Other sophisticated methods, such as variational autoencoders (VAE) and generative adversarial networks (GAN), have achieved better performance in SER. They emphasize the modeling of the distribution of the data, explicit form such as normal distribution for VAE and inexplicit form for GAN, rather than the data itself.
The feature representations learning from unsupervised models are usually used as the inputs of supervised classification models to train speech emotion recognition systems. Nevertheless, such an approach has an underlying problem. The former unsupervised learning plays the role of the feature extractor, while the target of the model is to recover the input signals perfectly. It means all information would persist as much as possible. However, we only need to focus on emotionally relevant information. On the other hand, the later supervised learning only concentrates on the information that is good for classification prediction. The extra information which is maybe supplementary for SER would be dropped. Therefore, the feature representations learning from unsupervised learning may not necessarily support the supervised classification task. The objectives of two steps, unsupervised part and supervised part are not consistent because their trainings are parted.
To address this problem, deep semi-supervised learning is proposed to dispose of the difficulty. Semi-supervised learning is the combination of unsupervised feature representation learning and supervised model training. The key is that these two parts are trained simultaneously so that the feature representations obtained from unsupervised learning can accord with the supervised model better. The typical structures, such as semi-supervised variational autoencoders and ladder networks, have achieved competitive performance with less labelled training samples in other areas.
Benefiting from the unsupervised learning part, semi-supervised learning can introduce great feature representations with the aid of many unlabeled examples to improve the performance of supervised tasks. Due to the scarcity of speech emotional data and richness of speech data, it is appropriate to apply semi-supervised learning approaches to speech emotion recognition. Actually, the part of auxiliary unsupervised learning also plays the role of regularization in the semi-supervised learning model. The regularization is essential to develop speech emotion recognition systems that generalize across different conditions. Conventional models obtained poor performance when the databases of training and testing are different. By training models that are optimized for primary and auxiliary tasks, the feature representations are more general, avoiding overfitting to a particular domain. It is appealing to create unsupervised auxiliary tasks to regularize the network.
Classic semi-supervised learning structure is an autoencoder which introduces additional unsupervised learning. The autoencoder structure can be replaced by other structures like DAE and VAE. More layers can be stacked. A more advanced structure is a semi-supervised ladder networks. Similar to DAE, every layer of a ladder network is intended to reconstruct their corrupted inputs. Further, the ladder network adds the lateral connections between each layer of the encoder and decoder, which is different from DAE. Figuratively, this is also the meaning of the term “ladder”, and it indicates the deep multilayer structure of the ladder network. The attraction of hierarchical layer models is the ability of modeling latent variables to learn from low layers to high layers. Generally, low layers represent the specific information while high layers can generate abstract features which are invariant and relevant for classification tasks. This can model more complex nonlinear structures than conventions methods.
Most unsupervised methods aim to learn intermediate feature representations that may not support the underlying emotion classification task. This paper proposes to employ the unsupervised reconstruction of the inputs as an auxiliary task to regularize the network, while optimizing the performance of an emotion classification system. We efficiently achieve this goal with a semi-supervised ladder network. The addition of the unsupervised auxiliary task not only provides powerful discriminative representations of the input features, but is also regarded as the regularization of primary emotional supervised task. The core contributions of this paper can be summarized as follows:
1) In this paper, we utilize semi-supervised learning with a ladder network for speech emotion recognition. We emphasize the importance of unsupervised reconstruction and skip connection modules. In addition, higher layers of the ladder network have a better ability to obtain discriminative features.
2) We show the benefit of semi-supervised ladder networks and that the promising results can be obtained with only a small number of labelled samples.
3) We compare the ladder network with DAE and VAE methods for emotion recognition from speech, showing superior performance of the ladder network. Besides, the convolutional neural network structure of the encoder and decoder has a better ability to encode emotional characteristics.
Semi-supervised Ladder Networks for Speech Emotion Recognition
Jian-Hua Tao, Jian Huang, Ya Li, Zheng Lian, Ming-Yue Niu
As a major component of speech signal processing, speech emotion recognition has become increasingly essential to understanding human communication. Benefitting from deep learning, many researchers have proposed various unsupervised models to extract effective emotional features and supervised models to train emotion recognition systems. In this paper, we utilize semi-supervised ladder networks for speech emotion recognition. The model is trained by minimizing the supervised loss and auxiliary unsupervised cost function. The addition of the unsupervised auxiliary task provides powerful discriminative representations of the input features, and is also regarded as the regularization of the emotional supervised task. We also compare the ladder network with other classical autoencoder structures. The experiments were conducted on the interactive emotional dyadic motion capture (IEMOCAP) database, and the results reveal that the proposed methods achieve superior performance with a small number of labelled data and achieves better performance than other methods.
Speech emotion recognition, the ladder network, semi-supervised learning, autoencoder, regularization.
For more up-to-date information:
1) WeChat: IJAC
3) Facebook:International Journal of Automation and Computing
4) Linkedin: Int.J. of Automation and Computing
5) Sina Weibo:IJAC-国际自动化与计算杂志