Eyemotion: Classifying Facial Expressions in VR Using Eye-Tracking Cameras · EMOTICONES

Auteur-es

Hickson, Steven; Dufour, Nick; Sud, Avneesh; Kwatra, Vivek; Essa, Irfan

Nombre Auteurs

5

Titre

Eyemotion: Classifying Facial Expressions in VR Using Eye-Tracking Cameras

Année de publication

2019

Référence (APA)

Hickson, S., Dufour, N., Sud, A., Kwatra, V., & Essa, I. (2019). Eyemotion : Classifying Facial Expressions in VR Using Eye-Tracking Cameras. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 1626‑1635. https://doi.org/10.1109/WACV.2019.00178

Mots-clés

ND

URL

https://ieeexplore.ieee.org/document/9052937

doi

https://doi.org/10.1109/WACV.2019.00178

Accessibilité de l'article

Open access

Champ

Machine Intelligence

Machine Perception

Type contenu (théorique Applicative méthodologique)

Applicative

Méthode

We propose a new approach aimed at classification of facial action units (AUs) [12] and ‘emotive’ expressions using only internally mounted infrared cameras within the HMD.
We are motivated by the recent availability of commercial HMDs with eye-tracking cameras [1]. This uses infrared cameras. Our model classifies user expressions using only limited periocular eye image data, which is further limited by the large amount of intra-class variation among users.
Recently convolutional neural networks (CNNs) [20, 15, 35] have performed very well on image classification tasks and are pervasive in machine learning and computer vision.
Our approach, based on deep learning, outperforms normal human accuracy and even advanced (trained users) human accuracy for categorizing select facial expressions from our dataset of only IR eye images.

Cas d'usage

HMD headsets with integrated infrared cameras

Objectifs de l'article

We propose to recognize and convey facial expressions from inside a VR HMD.

Question(s) de recherche/Hypothèses/conclusion

Research question(s) : Virtual reality equipment using head-mounted displays (HMD) makes natural expressions difficult to recognize as half the face is occluded. Thus for VR systems to provide rich social interaction, faithfully representing these expressions in some manner is absolutely critical.

Hypothesis(es) : We propose a new approach aimed at classification of facial action units (AUs) [12] and ‘emotive’ expressions using only internally mounted infrared cameras within the HMD. We are motivated by the recent availability of commercial HMDs with eye-tracking cameras [1]. This uses infrared cameras (Fig. 1B). These are used for tracking [29], but in our work we use the same input images for expression classification.

Conclusion(s) : Our primary contributions are: (1) Demonstrating that the information required to classify a variety of facial expressions is reliably present in IR eye images captured by a commercial HMD sensor, and that this information can be decoded using a CNN-based method. (2) A novel personalization technique to improve CNN accuracy on new users. Across experiments, personalization resulted in a 4% accuracy improvement on average, and was statistically significant for a set of basic ‘emotive’ expressions (p =0.018) and AUs (p =0.001) (Section 4.2). (3) The collection of a unique dataset (Section 3) of eye images paired with expression labels, collected with two separate commercial HMDs each with 23 different users. (4) We show our method can be used to generate expressive avatars in real-time, which can function as an expressive surrogate for users engaged in VR environments (Section 5.2).

Cadre théorique/Auteur.es

Expression classification from visual data (Pantic et Rothkrantz, 2000 ; Fasel et Luettin, 2003 ; Bettadapura, 2012 ; Sariyanidi, Gunes, et Cavallaro, 2015 ; Saatci et Town, 2006 ; Tian, Kanade, et Cohn, 2001)

Expression classification with alternate sensors (Scheirer, Fernandez, et Picard, 1999 ; Masai et al, 2016 ; Dhall et al., 2016 ; Suzuk et al., 2016)

Gaze tracking in VR (Burgos-Artizzu et al., 2015 ; Li et al., 2015 ; Olszewski et al., 2016 ; Thies et al., 2016 ; Zhao et al., 2016)

Concepts clés

Classification of facial expressions

Sentiment analysis

Données collectées (type source)

We collected a subset of facial action units that influence the upper face, and could be reliably performed by multiple subjects. We also distinguish between left and right AUs, where applicable. These are Neutral (AU0), Left Brow Raise (AU1+2L), Right Brow Raise (AU1+2R), Brow Lower (AU4), Upper Lid Raise (AU5), Squint (AU44), Both Eyes Closed (AU43), Left Wink (AU46L), Right Wink (AU46R), and Cheek Raise (AU6).
We also collect ‘emotive’ expressions for basic emotions as defined by [12], which are Neutral, Anger, Surprise, and Happiness.
We collected these data by asking users to form an expression, giving them an example from an exemplar video. While this may not result in spontaneous expressions [41], it provides explicit labels for each expression. To provide a realistic exemplar, we first recorded videos of trained actors performing each expression for the participant to use as a reference. During the collection process, for each expression, we provide to the participant the name of the expression, a looped clip of an actor performing the expression, and a live video of the participant in order for them to practice the expression. [...]This continues for all expressions and AUs (these are the images in column 1 of Fig. 2). We then have them put on the HMD and repeat the process twice more, taking the headset off and putting it back on to account for slippage and variation in fit. Each of these headset repetitions constitutes a ‘session.’

Définition des émotions

No definition

Refers to the emotions defined by Ekman and Friesen.

Ampleur expérimentation (volume de comptes)

We collected data with two separate HMDs [...].
23 different participants were collected with each HMD with different genders (for a total of 46), ethnicities, and hair color.
Of the 46 participants: 16 were female; 16 were aged 35 or over from an age range of 18 to 64 with a median age of 30; 11 participants had non-brown eyes and 4 had non-brown or black hair. 25 of our participants were nonwhite, with 9 Asian, 7 east Indian or south Asian, 4 two or more races, 3 Hispanic or Latino, 2 African American, and 2 preferring not to say.

Approximately 50,000 eye image pairs were collected per HMD (about 2,000 per participant).

Technologies associées

HMDs, with near IR (880nm) cameras.

Convolutional neural networks (CNN) to learn an embedding describing expressions and emotions using infrared eye images (variant of the widespread Inception architecture [37] using the TensorFlow library)

Mention de l'éthique

ND

Finalité communicationnelle

Facial expressions are essential for interpersonal communication and social interaction. They provide a means for conveying thought and emotion through visual cues that may not be easy to articulate verbally. However, virtual reality (VR) equipment using head-mounted displays (HMD) makes natural expressions difficult to recognize as half the face is occluded. Thus for VR systems to provide rich social interaction, faithfully representing these expressions in some manner is absolutely critical.

We have demonstrated using consumergrade eye-tracking cameras, which are already being included in VR headsets, a means to preserve and transmit social information among users engaged in VR

Résumé

One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. We present an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user's eyes captured from an IR gaze-tracking camera within a VR headset are sufficient to infer a subset of facial expressions without the use of any fixed external camera. Using these inferences, we can generate dynamic avatars in real-time which function as an expressive surrogate for the user. We propose a novel data collection pipeline as well as a novel approach for increasing CNN accuracy via personalization. Our results show a mean accuracy of 74% (F1 of 0.73) among 5 'emotive' expressions and a mean accuracy of 70% (F1 of 0.68) among 10 distinct facial action units, outperforming human raters.