SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video
- Auteur-es
- Marija Jegorova, Stavros Petridis, Maja Pantic
- Nombre Auteurs
- 3
- Titre
- SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video
- Année de publication
- 2023
- Référence (APA)
- Jegorova, M., Petridis, S., & Pantic, M. (2023). SS-VAERR : Self-Supervised Apparent Emotional Reaction Recognition from Video. 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 1‑8. https://doi.org/10.1109/FG57933.2023.10042638
- résumé
- This work focuses on the apparent emotional reaction recognition (AERR) from the video-only input, conducted in a self-supervised fashion. The network is first pre-trained on different self-supervised pretext tasks and later fine-tuned on the downstream target task. Self-supervised learning facilitates the use of pre-trained architectures and larger datasets that might be deemed unfit for the target task and yet might be useful to learn informative representations and hence provide useful initializations for further fine-tuning on smaller more suitable data. Our presented contribution is two-fold: (1) an analysis of different state-of-the-art (SOTA) pretext tasks for the video-only apparent emotional reaction recognition architecture, and (2) an analysis of various combinations of the regression and classification losses that are likely to improve the performance further. Together these two contributions result in the current state-of-the-art performance for the video-only spontaneous apparent emotional reaction recognition with continuous annotations.
- URL
- https://research.facebook.com/file/1055065362555485/SS-VAERR-Self-Supervised-Apparent-Emotional-Reaction-Recognition-from-Video.pdf
- doi
- https://doi.org/10.1109/FG57933.2023.10042638
- Accessibilité de l'article
- Libre
- Champ
- Computer vision, machine learning
- Type contenu (théorique Applicative méthodologique)
- Théorique, méthodologique
- Méthode
-
self-supervised learning (SSL), We examine three suitable pretext methods: LiRA [36], BYOL[13], and DINO [14].
Datasets are converted into gray-scale videos and cropped around the face to 96×96 based on the landmark detection. More specifically we use RetinaFace face detector [43] and the Face Alignment Network (FAN) [44] to detect 68 facial landmarks and crop the face based on these. - Cas d'usage
- N/A
- Objectifs de l'article
-
"(1) a review of several pretext tasks for apparent emotional
reaction recognition from video for their downstream per-
formance across several spontaneous emotion datasets; (2)
analysis of the impact of the combined regression and classi-
fication losses, data augmentations, and downstream learning parameters; (3) adding up to the first to our knowledge Self-Supervised Visual Apparent Emotional Reaction Recognition method for spontaneous emotions with continuous annotations, SS-VAERR." - Question(s) de recherche/Hypothèses/conclusion
-
The research question is how to improve video-only spontaneous AERR with continuous annotations using self-supervised learning and fine-tuning on downstream tasks.
"we have presented the first to our knowledge a self-supervised technique for the video-only natural apparent emotional reactions recognition, yielding the current state-of-the-art (or closely comparable) results for
video-only natural AERR." -
The hypothesis is that self-supervised learning and fine-tuning on downstream tasks can improve video-only spontaneous AERR with continuous annotations.
"we argue that the facial apparent emotional reactions recognition is highly data-specific. [...] video tends to be a better indicator for the video-aided recognition, and arousal tends to be better detected from audio modality. This makes the video-only AERR particularly challenging in terms of identifying the correct levels of arousal, and explains valence-arousal discrepancy for several results in this paper." -
The conclusions are that the proposed method achieves state-of-the-art performance for video-only spontaneous AERR with continuous annotations, and that the choice of pretext task and combination of losses can impact downstream performance.
"The self- supervised setting alone helps beating (or at least reaching
comparable results with) the current state-of-the-art without
even touching upon the loss function design." - Cadre théorique/Auteur.es
- The theoretical framework of the article includes self-supervised learning and apparent emotional reaction recognition. The main authors cited include E. Sanchez, M. K. Tellamekala, M. Hu, Q. Chu, J. Kossaifi, and R. Walecki.
- Concepts clés
- Emotional reaction recognition
- Données collectées (type source)
-
Lip Reading Sentences 3 dataset
(LRS3) [42], containing thousands of spoken sentences from TED and TEDx videos.
SEWA database consists of the videos of volunteers watching adverts chosen to elicit apparent emotional
reactions, and later discussing what they have seen
RECOLA is a database of multi-domain data recordings of
native French-speaking participants completing a collaborative task in pairs during a video conference call, collected in France. - Définition des émotions
- Categorical emotions
- Ampleur expérimentation (volume de comptes)
- Thousands
- Technologies associées
-
Retina Face Detector, Face Alignment Net-
work (FAN), Machine learning, Deep learning, Computer vision - Mention de l'éthique
- Non
- Finalité communicationnelle
- As a result, we achieve the current state-of-the-art performance for video-only spontaneous AERR with continuous annotations.
- Commentaires
- Contenu supplémentaire vidéo : https://www.facebook.com/watch/?v=3040865399547468
- Pages du site
- Contenu
Fait partie de SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video