A Ground-Truth-Free Framework for Validating Emotions in Generative AI Speech Synthesis

Özcan, Ahmet Remzi2026-02-082026-02-082026https://doi.org/10.1109/ACCESS.2026.3656800https://hdl.handle.net/20.500.12885/5294Evaluating emotional expressivity in synthetic speech is challenging due to the absence of ground-truth affective labels and the reliance on costly human perceptual studies. This paper introduces a prototype-based framework that integrates affect-specialized Emotion2Vec embeddings with general-purpose acoustic and linguistic representations from WavLM to enable scalable and system-agnostic evaluation. Embeddings are projected into a shared latent space where each emotion category is represented by a learnable prototype, supporting both categorical classification and a continuous similarity-based metric, the Emotion Adherence Score (EAS). While categorical performance varied across systems, EAS remained consistently high, highlighting its robustness in capturing graded affective fidelity. On a 1,400-utterance corpus spanning four heterogeneous TTS systems, the proposed method achieved substantial improvements over a strong embedding baseline, increasing accuracy from 51.43% to 77.50% and macro-F1 from 0.5109 to 0.7736. Human ratings further supported EAS, showing a moderate positive correlation with human judgments. Overall, the proposed framework provides a principled and scalable approach for benchmarking emotional expressivity in TTS, bridging categorical and continuous perspectives and reducing reliance on ground-truth labels and large-scale listening tests. © 2013 IEEE.eninfo:eu-repo/semantics/openAccessEmotion Adherence ScoreEmotional Text-to-SpeechPrototype-based LearningSelf-supervised Speech RepresentationsSpeech Emotion RecognitionA Ground-Truth-Free Framework for Validating Emotions in Generative AI Speech SynthesisArticle10.1109/ACCESS.2026.36568002-s2.0-105028225336Q1