A Ground-Truth-Free Framework for Validating Emotions in Generative AI Speech Synthesis
Küçük Resim Yok
Tarih
2026
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Institute of Electrical and Electronics Engineers Inc.
Erişim Hakkı
info:eu-repo/semantics/openAccess
Özet
Evaluating emotional expressivity in synthetic speech is challenging due to the absence of ground-truth affective labels and the reliance on costly human perceptual studies. This paper introduces a prototype-based framework that integrates affect-specialized Emotion2Vec embeddings with general-purpose acoustic and linguistic representations from WavLM to enable scalable and system-agnostic evaluation. Embeddings are projected into a shared latent space where each emotion category is represented by a learnable prototype, supporting both categorical classification and a continuous similarity-based metric, the Emotion Adherence Score (EAS). While categorical performance varied across systems, EAS remained consistently high, highlighting its robustness in capturing graded affective fidelity. On a 1,400-utterance corpus spanning four heterogeneous TTS systems, the proposed method achieved substantial improvements over a strong embedding baseline, increasing accuracy from 51.43% to 77.50% and macro-F1 from 0.5109 to 0.7736. Human ratings further supported EAS, showing a moderate positive correlation with human judgments. Overall, the proposed framework provides a principled and scalable approach for benchmarking emotional expressivity in TTS, bridging categorical and continuous perspectives and reducing reliance on ground-truth labels and large-scale listening tests. © 2013 IEEE.
Açıklama
Anahtar Kelimeler
Emotion Adherence Score, Emotional Text-to-Speech, Prototype-based Learning, Self-supervised Speech Representations, Speech Emotion Recognition
Kaynak
IEEE Access
WoS Q Değeri
Scopus Q Değeri
Q1












