Textless Speech Emotion Conversion using Discrete & Decomposed Representations · EMOTICONES

Auteur-es

Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Emmanuel Dupoux, Yossi Adi

Nombre Auteurs

9

Titre

Textless Speech Emotion Conversion using Discrete & Decomposed Representations

Année de publication

2022

Référence (APA)

Kreuk, F., Polyak, A., Copet, J., Kharitonov, E., Nguyen, T.-A., Rivière, M., Hsu, W.-N., Mohamed, A., Dupoux, E., & Adi, Y. (2022). Textless Speech Emotion Conversion using Discrete & Decomposed Representations.

résumé

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link here.

URL

https://research.facebook.com/file/509033718007989/Textless-Speech-Emotion-Conversion-using-Discrete-and-Decomposed-Representations.pdf

doi

https://doi.org/10.48550/arXiv.2111.07402

Accessibilité de l'article

Libre

Champ

Natural Language Processing & Speech

Type contenu (théorique Applicative méthodologique)

Méthodologique, applicatif

Méthode

The method allows for the modification of non-verbal vocalizations such as laughter insertion and yawning removal while preserving the lexical content and speaker identity.

Cas d'usage

N/A

Objectifs de l'article

The study presents a novel approach to modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. The study also includes a rigorous evaluation of the proposed method and compares it against current approaches. The proposed method has been rigorously evaluated and has been found to be superior to current approaches in terms of perceived emotion and audio quality.

Question(s) de recherche/Hypothèses/conclusion

Research question is How to modify the perceived emotion of a speech utterance while preserving the lexical content and speaker identity ?

The hypothesis is that "modifying the learned representations of a speech signal can effectively change the perceived emotion of the speech while preserving the lexical content and speaker identity."

The concusions are "The researchers found a way to break down the voice into different parts and change them to make the voice sound happy, sad, angry, or other emotions. They tested this new way and found that it works better than other ways that people have tried before. They also found that changing the non-verbal parts of the voice, like laughter or sighs, can make a big difference in how the voice sounds. Overall, this new way of changing someone's voice is really cool and could be useful in things like movies or video games where characters need to sound different emotions."

Cadre théorique/Auteur.es

Speech processing, natural language processing : Vaswani et al. (2017); Polyak et al. (2021); Kharitonov et al. (2021a); Schuller et al. (2013); Fan et al. (2014); Wang et al. (2018)

Concepts clés

Speech emotion conversion, Non-verbal communication cues, Spoken language translation

Données collectées (type source)

Dataset of emotional speech recordings, which were labeled with six different emotions (Neutral, Happy, Sad, Angry, Disgusted, and Sleepy)

Définition des émotions

Categorical emotions

Ampleur expérimentation (volume de comptes)

7000 speech utterances based on transcripts from the CMU Arctic Database

Technologies associées

HuBERT: A pre-trained neural network model for speech processing, which is used as a feature extractor in the emotion conversion model.

Transformer: A type of neural network architecture that uses self-attention mechanisms to process sequential data, which is used as the main architecture for the emotion conversion model.

PyTorch: A popular open-source machine learning framework, which is used to implement the emotion conversion model and train it on the emotional speech dataset.

Mention de l'éthique

Non

Finalité communicationnelle

"We demonstrated how the proposed system is able to model expressive non-verbal vocalizations as well as generate high- quality expressive speech. We conclude with an ablation study and analysis of the different components composing our system. This study serves as the foundation for improving speech emotion conversion and building general textless expressive speech generation models."