3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children With Autism
- Auteur-es
- Marinoiu, Elisabeta; Zanfir, Mihai; Olaru, Vlad; Sminchisescu, Cristian
- Nombre Auteurs
- 4
- Titre
- 3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children With Autism
- Année de publication
- 2018
- Référence (APA)
- Marinoiu, E., Zanfir, M., Olaru, V., & Sminchisescu, C. (2018). 3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children With Autism. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, 2158‑2167. https://doi.org/10.1109/CVPR.2018.00230
- Mots-clés
- ND
- URL
- https://openaccess.thecvf.com/content_cvpr_2018/html/Marinoiu_3D_Human_Sensing_CVPR_2018_paper.html
- Accessibilité de l'article
- Open access
- Champ
- Machine Intelligence
- Machine Perception
- Type contenu (théorique Applicative méthodologique)
- Applicative
- Méthode
-
We analyze a large scale video dataset containing child-therapist interactions and subtle behavioral annotations.
We adapt state-of-the-art 3d human pose estimation models to this setting, making it possible to reliably track and reconstruct both the child and the therapist, from RGB data, at comparable performance levels with an industrial-grade Kinect system.
We establish several action and emotion recognition baselines, including systems based on child representations, and models that jointly capture the child and the therapist. - Cas d'usage
- Robot-assisted autism treatment
- Objectifs de l'article
-
In this paper, we introduce fine-grained action classification and emotion prediction tasks defined on non-staged videos, recorded during robot-assisted therapy sessions of children with autism
Our long term goal is to automatically interpret and react to a child’s actions in the challenging setting of a therapy session. In order to understand the child, we rely on highlevel features associated to her/his 3d pose and shape. - Question(s) de recherche/Hypothèses/conclusion
- Research question(s) : We have introduced large-scale fine-grained action and emotion recognition tasks defined on non-staged videos recorded during robot-assisted therapy sessions of children with autism. The tasks are challenging due to the large numbers of sequences (over 3,700), long videos (10-15 minutes each), large number of highly variable actions (37 child action classes, 19 therapist actions), and because children are only partially visible and observed under non-standard camera viewpoints. Age variance and unpredictable behavior add to the challenges.
- Hypothesis(es) : We investigated how state-of-theart RGB 3d human pose reconstruction methods combining feedforward and feedback components can be adapted to the problem, and evaluated multiple action and emotion recognition baselines based on 2d and 3d representations of the child and therapist.
- Conclusion(s) : Our results indicate that properly adapted, the current 2d and 3d reconstruction methods from RGB data are competitive with industrial grade RGB-D Kinect systems. With action recognition baselines in the 40-50% performance range, the large-scale data we introduce represents a challenge in modeling behavior, with impact in both computer vision, and child-robot interaction with applications to autism.
- Cadre théorique/Auteur.es
- Treating autism with predictable systems (Gizzonio et al., 2014 ; Ramdoss et al., 2011 ; Moore, McGrath, et Thorpe, 2000)
- Interaction approaches based on humanoid robot (Esteban et al., 2017 ; Farr, Yuill, et Raffle, 2010 ; Pop et al., 2013 ; Salvador, Silver, et Mahoor, 2015 ; Wainer et al., 2014 ; Wainer et al., 2014)
- Facial expressions and emotion understandings (Kossaifi et al., 2017 ; Nicolaou, Gunes, et Pantic, 2010 ; Mollahosseini, Hasani, et Mahoor, 2017)
- Body language and emotion understandings (Lhommet et Marsella, 2015)
- Automatically detecting, classifying and interpreting human action from body pose features (Liu et al., 2016 ; Ke et al., 2017 ; Huang et al., 2016 ; Du, Wang, et Wang, 2015 ; Zanfir, Leordeanu, et Sminchisescu, 2013)
- Rely on RGBD sensors to estimate 2d and 3d human pose (Cao et al., 2017 ; Popa, Zanfir, et Sminchisescu, 2017 ; Loper et al., 2015 ; Bogo et al., 2016 ; Martinez et al., 2017 ; Pavlakos et al., 2017 ; Zhou et al., 2016
- Concepts clés
- Detection, classification and interpretation of human action and emotions
- Données collectées (type source)
-
The DE-ENIGMA dataset contains multi-modal recordings of therapy sessions of children with autism. [...]
A selection of recordings from multiple therapy sessions was annotated. The children selection covers a variety of gestures and interactions for typical therapy sessions. The annotation of therapy videos relies on an extensive web-based tool developed by us that can (i) select temporal extents and (ii) assign them a class label.
A video selection from [DE-ENIGMA] was also annotated with continuous emotions in a valence-arousal space by 5 specialized therapists. The valence axis specifies whether the emotion is positive or negative, whereas arousal controls its intensity. - Définition des émotions
- No definition
- Collaboration with 5 specialized therapists to classify emotions
- Positive and negative labeling
- Ampleur expérimentation (volume de comptes)
-
A selection of recordings from multiple therapy sessions of 7 children was annotated with 37 action classes.
We have annotated a total of 3757 sequences, with an average duration of 2.1 seconds.
The experiments presented in this paper use a subset of 2031 annotated sequences spanning over 24 classes common to all children. [...] Among the annotated sequences, around a third (749 out of 2, 031) are interacting sequences. - Technologies associées
- 2D et 3D ("3d skeleton data")
- RGBD sensors such as Kinect
- Humanoid robots
- Convolutional Neural Networks
- Recurrent Neural Networks (hierarchical bidirectional recurrent network baseline, HBRNN)
- Multitask deep neural network (DMHS)
- Mention de l'éthique
- ND
- Finalité communicationnelle
- With action recognition baselines in the 40-50% performance range, the large-scale data we introduce represents a challenge in modeling behavior, with impact in both computer vision, and child-robot interaction with applications to autism
- Résumé
- We introduce new, fine-grained action and emotion recognition tasks defined on non-staged videos, recorded during robot-assisted therapy sessions of children with autism. The tasks present several challenges: a large dataset with long videos, a large number of highly variable actions, children that are only partially visible, have different ages and may show unpredictable behaviour, as well as non-standard camera viewpoints. We investigate how state-of-the-art 3d human pose reconstruction methods perform on the newly introduced tasks and propose extensions to adapt them to deal with these challenges. We also analyze multiple approaches in action and emotion recognition from 3d human pose data, establish several baselines, and discuss results and their implications in the context of child-robot interaction.
- Pages du site
- Contenu
Fait partie de 3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children With Autism