Human emotions are expressed and recognized by multiple interacting signals such as facial expressions and both prosody and semantic content of utterances. While state-of -the-art speech interfaces lack the ability of understanding and responding to emotions, we envision a future where conversational agents can understand and interpret the users emotional state and can respond appropriately. We conducted a simulator experiment to collect a multimodal dataset with human speech utterances in a car environment comprising three modalities: the audio signal of a spoken interaction, the visual signal of the driver’s face, and the transcribed content of the utterances. While most existing approaches are limited to one modality, the goal of this project is to advance the use of multiple signals for emotion recognition in speech events.
First, since there is a lack of commercially available software for real-time emotion recognition from uttered text as well as large text datasets from the automotive domain, we used a neural transfer learning approach for emotion recognition from text which
utilizes existing resources from other domains. We see that transfer learning enables models based on out-of-domain corpora to perform well. This method contributes up to 10 percentage points in F, with up to 76 micro-average F across the emotions joy, annoyance and insecurity.
To improve recognition performance, we investigate how information from the different channels (prosody, uttered text, and face) can be combined. We examine fusion approaches at different levels as well as different machine learning algorithms aiming to find the most suitable approach for real-time multimodal emotion recognition in speech events.