The challenge
Low precision accuracy in Speech Emotion Recognition (SER)
Speech Emotion Recognition (SER) is an emerging area of research which has an increasing number of applications in practice. Since speech is a main form of emotional and affection display, further developments in SER technology are redefining human interactions.
Speech based diagnostic systems are also being developed for the diagnoses of depression, distress and the monitoring of mood states for bi-polar patients.
Other applications include media retrieval, the process of receiving information from multimedia data sources, including text, audio and video, while smart car production and forensic sciences also aim to improve their performances by utilising SER techniques.
Despite the emerging importance of SER, precision accuracy is quite low, and upgrades are required to make commercial applications viable. A key underlying reason for the low level of accuracy is the scarcity of emotion datasets. Lack of data is a challenge when developing any robust machine learning model.
Human emotions in speech are complex to model, as speech is dependent upon speaker characteristics such as gender, age, culture and dialect, amongst others.
Our response
A multi-task learning framework
Data61, in collaboration with partner researchers have developed a multi-task learning framework, a subfield of machine learning, whereby multiple learning tasks are completed simultaneously.
Auxiliary tasks assist, within the framework, in locating rich and robust representations of input data such as gender identification and speaker recognition, as this data is plentifully available.
Utilising this data in the training phase, the model can improve its performance by analysing the similarities and differences in the information. This can improve the accuracy, and ultimately the performance, of SER, where, at present, limited data is available.
To maximise multi-task learning, an adversarial encoder (AAE) was placed within the framework. AAE is an un-supervised learning model, which has a strong capability to learn powerful and discriminative features.
By combining AAE with the supervised classification networks (emotion, speaker, characteristics and gender classification), semi-supervised learning for AAE can be enabled.
The joint optimisation of multi-task, supervised, and AAE un-supervised functions can lead to more discriminative SER models through semi-supervised learning
The results
Overcoming the challenge of the limited availability of emotion datasets
Using two publicly available and popular emotion datasets, our framework has demonstrated that it performs better than the comparable state-of-the-art studies in SER that use similar methodology and/or implementation strategies; supervised single- and multi-task methods based on Convolution Neural Network (CNN), and single- and multi-task semi-supervised autoencoders.
We have observed this in categorical and dimensional emotion classifications, and cross-corpus SER.
The proposed approach can overcome the challenge of limited data availability of emotion datasets, which is a significant contribution towards developing a robust machine learning model for SER.
Future work will focus on tighter coupling between the generation of data and modelling a richer selection of speaker states and traits, whilst simultaneously aiming for a "holistic" speaker analysis.
We are also planning to integrate reinforcement learning into the framework in conjunction with a dialogue manager, a specialised form of computer system that operates as an interface between users and the application, using spoken language as the primary means of communication.
Reinforcement learning provides a framework which interacts with the environment and works in real life situations, such as managing the state and flow of a conversation.
For more information see Multi-Task Semi-Supervised Adversarial Encoding for Speech Emotion Recognition.