Open Voice Cloning Leaderboard

The Open Voice Cloning Leaderboard ranks and evaluates the voice cloning models across diverse datasets, including emotional speech.
It also delivers an in-depth analysis of how different acoustic features shape the final results.

The results represent the cosine similarity between the speaker embeddings of the original and cloned samples, generated by the WavLM model.


1	WhisperSpeech/WhisperSpeech	0.8356	0.8881	0.7618	0.8539	0.8135	0.8167


1	coqui/XTTS-v2	0.8356	0.8881	0.806	0.8539	0.8135	0.8167
2	microsoft/speecht5_vc	0.8298	0.9099	0.7618	0.8265	0.7987	0.8521
3	Plachtaa/VALL-E-X	0.7862	0.901	0.7412	0.7382	0.7674	0.7832
4	WhisperSpeech/WhisperSpeech	0.7837	0.9014	0.7284	0.6972	0.7725	0.8188
5	OuteAI/OuteTTS-0.2-500M	0.7499	0.8836	0.7359	0.7696	0.5394	0.8207


1	WhisperSpeech/WhisperSpeech	0.8098	0.7962	0.8459	0.8201	0.8247


1	coqui/XTTS-v2	0.8098	0.7962	0.8459	0.8201	0.777
2	microsoft/speecht5_vc	0.7923	0.7834	0.8128	0.7483	0.8247
3	Plachtaa/VALL-E-X	0.7623	0.7488	0.7683	0.7645	0.7678
4	WhisperSpeech/WhisperSpeech	0.7462	0.7305	0.661	0.7899	0.8034
5	OuteAI/OuteTTS-0.2-500M	0.7197	0.7613	0.7815	0.5696	0.7664


1	WhisperSpeech/WhisperSpeech	0.5943	0.5863	0.3993	0.5315	0.6963	0.7579


1	WhisperSpeech/WhisperSpeech	0.5943	0.5863	0.3993	0.5315	0.6963	0.7579
2	Plachtaa/VALL-E-X	0.5833	0.5818	0.5619	0.4485	0.6516	0.6725
3	OuteAI/OuteTTS-0.2-500M	0.5659	0.6094	0.5131	0.526	0.4617	0.7195
4	coqui/XTTS-v2	0.541	0.5287	0.4238	0.3917	0.658	0.703
5	microsoft/speecht5_vc	0.5274	0.5278	0.396	0.309	0.5758	0.8283

@misc{christop2025clonevalopenvoicecloning,
    title={{ClonEval: An Open Voice Cloning Benchmark}}, 
    author={Iwona Christop and Tomasz Kuczyński and Marek Kubis},
    year={2025},
    eprint={2504.20581},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.20581}, 
}

@article{crema-d,
    author={Cao, Houwei and Cooper, David G. and Keutmann, Michael K. and Gur, Ruben C. and Nenkova, Ani and Verma, Ragini},
    journal={IEEE Transactions on Affective Computing},
    title={{CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset}},
    year={2014},
    volume={5},
    number={4},
    pages={377--390},
    doi={10.1109/TAFFC.2014.2336244},
}

@inproceedings{librispeech2015,
    author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
    booktitle={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={{Librispeech: An ASR corpus based on public domain audio books}}, 
    year={2015},
    pages={5206-5210},
    keywords={Resource description framework;Genomics;Bioinformatics;Blogs;Information services;Electronic publishing;Speech Recognition;Corpus;LibriVox},
    doi={10.1109/ICASSP.2015.7178964}
}

@article{ravdess,
    doi={10.1371/journal.pone.0196391},
    author={Livingstone, Steven R. AND Russo, Frank A.},
    journal={PLOS ONE},
    publisher={Public Library of Science},
    title={{The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English}}",
    year={2018},
    month=may,
    volume={13},
    URL={https://doi.org/10.1371/journal.pone.0196391},
    pages={1--35},
    number={5},
}

@inbook{savee,
    author={Haq, S. and Jackson, P. J. B.},
    booktitle={{Machine Audition: Principles, Algorithms and Systems}},
    title={{Multimodal Emotion Recognition}},
    publisher={IGI Global},
    address={Hershey PA},
    year={2010},
    month=aug,
    editor={Wang, W.},
    pages={398--423},
}

@misc{tess,
    author={Pichora-Fuller, M. Kathleen and Dupuis, Kate},
    publisher={Borealis},
    title={{Toronto emotional speech set (TESS)}},
    year={2020},
    version={DRAFT VERSION},
    doi={10.5683/SP2/E8H2MF},
    URL={https://doi.org/10.5683/SP2/E8H2MF},
}

Open Voice Cloning Leaderboard

📝 About

Evaluation Procedure

Software Library

✉️✨ Submit Your Model Here! ✨✉️

📌 How to Submit Your Model: