Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?
- Auteur-es
- Abhinav Shukla, Stavros Petridis, Maja Pantic
- Nombre Auteurs
- 3
- Titre
-
Does Visual Self-Supervision Improve Learning
of Speech Representations for Emotion
Recognition? - Année de publication
- 2021
- Référence (APA)
- Shukla, A., Petridis, S., & Pantic, M. (2021). Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? IEEE Transactions on Affective Computing, 14(1), 406‑420. https://doi.org/10.1109/TAFFC.2021.3062406
- résumé
- Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for cross-modal self-supervision. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes an audio-only self-supervision approach for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining can outperform fully supervised training and is especially useful to prevent overfitting on smaller sized datasets. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition. We outperform existing self-supervised methods for all tested downstream tasks. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.
- Mots-clés
-
Self-supervised learning, Representation learning, Generative modeling, Audiovisual speech, Emotion recognition,
Speech recognition, Cross-modal self-supervision. - URL
- https://research.facebook.com/file/4493784684073249/Does-Visual-Self-Supervision-Improve-Learning-of-Speech-Representations-for-Emotion-Recognition.pdf
- doi
- https://doi.org/10.1109/TAFFC.2021.3062406
- Accessibilité de l'article
- Libre
- Champ
- Artificial Intelligence, Machine Learning
- Type contenu (théorique Applicative méthodologique)
- Méthodologique, applicatif
- Méthode
-
self-supervised learning (SSL).
The method proposed in this article is a multi-task approach that combines audio and visual modalities for self-supervised learning in speech representation. The approach involves using visual self-supervision via face reconstruction to guide the learning of audio representations. The authors evaluate the effectiveness of their approach on several tasks, including discrete emotion recognition, continuous affect recognition, and automatic speech recognition. - Cas d'usage
-
The three self-supervised methods we
compare against are CPC, APC, and PASE - Objectifs de l'article
-
The objectives of the article are to investigate the benefits of combining audio and visual modalities for self-supervised learning in speech representation, and to propose a multi-task approach that outperforms existing self-supervised methods for several tasks.
"we investigate self-supervised learning for audio representations."
"we examine the state-of-the-art in self-supervised audio feature learning which we use as baselines. We then propose a novel visual self-supervised method and a novel audio-only self-supervised method for learning audio features. We also show how visual self-supervision helps encode emotional information into the audio features." - Question(s) de recherche/Hypothèses/conclusion
- The research question is "Can combining audio and visual modalities for self-supervised learning in speech representation improve performance on tasks such as discrete emotion recognition, continuous affect recognition, and automatic speech recognition?"
- The hypothesis is that the proposed multi-task approach that combines audio and visual modalities for self-supervised learning in speech representation will outperform existing self-supervised methods for several tasks.
-
The authors conclude that their proposed multi-task approach that combines audio and visual modalities for self-supervised learning in speech representation outperforms existing self-supervised methods for several tasks, including discrete emotion recognition, continuous affect recognition, and automatic speech recognition.
"proposed visual self-supervision is superior when compared to the proposed audio-only self-supervision. The results on both discrete and continuous affect recognition offer evidence that the learned representation is good for emotion."
"The models trained using a combination of audio and visual self-supervision are able to encode complementary information from each modality to yield the best possible representations among all tested methods in this work." - Cadre théorique/Auteur.es
- The theoretical framework of the article is based on previous work in self-supervised learning and multimodal learning. The main authors cited include Hjelm et al., 2018; Owens et al., 2018; and Arandjelovic and Zisserman, 2017.
- Concepts clés
- Discrete Emotion Recognition, Continuous Affect Recognition, Automatic Speech Recognition.
- Données collectées (type source)
-
CREMA-D dataset : 91 actors who utter 12 sentences multiples times each with a
different level of intensity for each of 6 basic emotional
labels (anger, fear, disgust, neutral, happy, sad).
The Ravdess dataset contains 1440 samples of 24 different actors who acted out two sentences with 8 different basic emotions (anger, calm, sad, neutral, happy, disgusted, surprised, fear) and two different intensity levels.
The RECOLA dataset contains dyadic conversations in French between a pair of participants working to solve a collaborative task over video conference. The annotated part of the dataset consists of 5-minute long clips (with audio from one speaker only) with continuous valence and arousal annotations from 6 annotators.
The SEWA dataset [58] contains dyadic conversations over video conference between a pair of participants discussing about an advertisement that they have just watched. The audio clips are typically 3-minute long and have continuous valence and arousal annotations.
The IEMOCAP dataset [59] contains dyadic conversations between 10 speakers for a total of 12 hours of audiovisual data. The discrete emotion labels comprise of 8
categories (anger, happiness, sadness, neutral, excitement, frustration, fear, surprise), however we only consider the first 4 categories for our experiments (anger, happiness,
sadness, neutral
The SPC (Speech Commands) dataset contains 64,727 total utterances of 30 different words by 1,881 speakers. We use SPC as a speech recognition evaluation dataset.
The LRW dataset is a large, in-the-wild dataset of 500 different isolated words primarily from BBC recordings. It is an audiovisual speech dataset and is thus appropriate for training our methods. - Définition des émotions
- Categorical emotions
- Ampleur expérimentation (volume de comptes)
-
1440 samples of 24 different actors
12 hours of au-
diovisual data
64,727 total utterances of 30 different words by 1,881 speakers
500 different isolated words - Technologies associées
- Self-supervised learning, Audio-visual modalities, Speech representation, Emotion recognition, Affect recognition, Automatic speech recognition.
- Mention de l'éthique
- Non
- Finalité communicationnelle
- To propose "a multi-task approach that outperforms existing self-supervised methods for discrete emotion recognition, continuous affect recognition, and automatic speech recognition."
- Pages du site
- Contenu
Fait partie de Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?