Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models
- Auteur-es
- Lu, Zhiyun; Cao, Liangliang; Zhang, Yu; Chiu, Chung-Cheng; Fan, James
- Nombre Auteurs
- 5
- Titre
- Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models
- Année de publication
- 2020
- Référence (APA)
- Lu, Z., Cao, L., Zhang, Y., Chiu, C.-C., & Fan, J. (2020). Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7149‑7153. https://doi.org/10.1109/ICASSP40776.2020.9052937
- Mots-clés
- Speech sentiment analysis, ASR pretraining, End-to-end ASR model
- URL
- https://ieeexplore.ieee.org/document/9052937
- doi
- https://doi.org/10.1109/ICASSP40776.2020.9052937
- Accessibilité de l'article
- Open access
- Champ
- Speech Processing
- Type contenu (théorique Applicative méthodologique)
- Applicative
- Méthode
- We propose to use pre-trained features from e2e ASR model to solve sentiment analysis (the best performed sentiment decoder is RNN with self-attention)
- Cas d'usage
- ND
- Objectifs de l'article
- In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task.
- Question(s) de recherche/Hypothèses/conclusion
- Research question(s) : The key challenge in speech sentiment analysis is how to learn a good representation that captures the emotional signals and remains invariant under different speakers, acoustic conditions, and other natural speech variations. In this work, we introduce a new direction to tackle the challenge.
-
Hypothesis(es) : We propose to use end-to-end (e2e) automatic speech recognition (ASR) as pre-training, and solve the speech sentiment as a down-stream task. This approach is partially motivated by the success of pre-training in solving tasks with limited labeled data in both computer vision and language. Moreover, the e2e model combines both acoustic and language models of traditional ASR, thus can seamlessly integrate the acoustic and text features into one representation.
We hypothesize that the ASR pre-trained representation works well on sentiment analysis. - Conclusion(s) : We evaluate the performance of pre-trained ASR features on both IEMOCAP and SWBD-sentiment. On IEMOCAP, we improve the state-ofthe-art sentiment analysis accuracy from 66.6% to 71.7%. On SWBD-sentiment, we achieve 70.10% accuracy on the test set, outperforming strong baselines.
- Cadre théorique/Auteur.es
- Speech sentiment analysis (Li et al., 2019 ; Li et al., 2018 ; Wu et al., 2019 ; Xie et al., 2019 ; Tzirakis, Zhang, et Schuller, 2018)
- Speech and text sentiment analysis (Kim et Shin, 2019 ; Prabhavalkar et al, 2017 ; Cho et al., 2018 ; Gu et al., 2018)
- End-to-end automatic speech recognition (Prabhavalkar et al., 2017 ; Rao, Sak, et Prabhavalkar, 2017 ; Chiu et al., 2018 ; He et al., 2019)
- Concepts clés
- Speech sentiment analysis
- Données collectées (type source)
-
We use two datasets IEMOCAP and SWBD-sentiment in the experiments. IEMOCAP [18] is a well-benchmarked speech emotion recognition dataset. Following the protocol in [1, 7, 9, 10], we experiment on a subset of the data, which contains 4 emotion classes {happy+excited, neutral, sad, angry}, with {1708, 1084, 1636, 1103} utterances respectively.
To further investigate speech sentiment task, we annotate a subset of switchboard telephone conversations [17] with three sentiment labels, i.e. negative, neutral and positive, and create the SWBD-sentiment dataset. - Définition des émotions
- No definition
- Use of sentiment categories/groups
- Negative, neutral, positive labeling
- Ampleur expérimentation (volume de comptes)
-
IEMOCAP : contains approximately 12 hours audiovisual recording of both scripted and improvised interactions performed by actors.
SWBD-sentiment : over 140 hours of speech which contains approximately 49.5k utterances. - Technologies associées
- Automatic Speech Recognition (ASR)
- End-to-end Automatic Speech Recognition
- SpecAugment
- Mention de l'éthique
- ND
- Finalité communicationnelle
-
Speech sentiment analysis is an important problem for interactive intelligence systems with broad applications in many industries, e.g., customer service, health-care, and education.
Moreover, we create a large-scale speech sentiment dataset SWBD-sentiment to facilitate future research in this field. Our future work includes experimenting with unsupervised learnt speech features, as well as applying end-to-end ASR features to other down-stream tasks like diarization, speaker identification, and etc. - Résumé
- In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to help interpret model predictions. We use well benchmarked IEMOCAP dataset and a new large-scale speech sentiment dataset SWBD-sentiment for evaluation. Our approach improves the-state-of-the-art accuracy on IEMOCAP from 66.6% to 71.7%, and achieves an accuracy of 70.10% on SWBD-sentiment with more than 49,500 utterances.
- Pages du site
- Contenu
Fait partie de Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models