A Ground-Truth-Free Framework for Validating Emotions in Generative AI Speech Synthesis
| dc.contributor.author | Özcan, Ahmet Remzi | |
| dc.date.accessioned | 2026-02-08T15:11:11Z | |
| dc.date.available | 2026-02-08T15:11:11Z | |
| dc.date.issued | 2026 | |
| dc.department | Bursa Teknik Üniversitesi | |
| dc.description.abstract | Evaluating emotional expressivity in synthetic speech is challenging due to the absence of ground-truth affective labels and the reliance on costly human perceptual studies. This paper introduces a prototype-based framework that integrates affect-specialized Emotion2Vec embeddings with general-purpose acoustic and linguistic representations from WavLM to enable scalable and system-agnostic evaluation. Embeddings are projected into a shared latent space where each emotion category is represented by a learnable prototype, supporting both categorical classification and a continuous similarity-based metric, the Emotion Adherence Score (EAS). While categorical performance varied across systems, EAS remained consistently high, highlighting its robustness in capturing graded affective fidelity. On a 1,400-utterance corpus spanning four heterogeneous TTS systems, the proposed method achieved substantial improvements over a strong embedding baseline, increasing accuracy from 51.43% to 77.50% and macro-F1 from 0.5109 to 0.7736. Human ratings further supported EAS, showing a moderate positive correlation with human judgments. Overall, the proposed framework provides a principled and scalable approach for benchmarking emotional expressivity in TTS, bridging categorical and continuous perspectives and reducing reliance on ground-truth labels and large-scale listening tests. © 2013 IEEE. | |
| dc.identifier.doi | 10.1109/ACCESS.2026.3656800 | |
| dc.identifier.scopus | 2-s2.0-105028225336 | |
| dc.identifier.scopusquality | Q1 | |
| dc.identifier.uri | https://doi.org/10.1109/ACCESS.2026.3656800 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12885/5294 | |
| dc.indekslendigikaynak | Scopus | |
| dc.language.iso | en | |
| dc.publisher | Institute of Electrical and Electronics Engineers Inc. | |
| dc.relation.ispartof | IEEE Access | |
| dc.relation.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.snmz | Scopus_KA_20260207 | |
| dc.subject | Emotion Adherence Score | |
| dc.subject | Emotional Text-to-Speech | |
| dc.subject | Prototype-based Learning | |
| dc.subject | Self-supervised Speech Representations | |
| dc.subject | Speech Emotion Recognition | |
| dc.title | A Ground-Truth-Free Framework for Validating Emotions in Generative AI Speech Synthesis | |
| dc.type | Article |












