Open Voice Cloning Leaderboard
The Open Voice Cloning Leaderboard ranks and evaluates the voice cloning models across
diverse datasets, including emotional speech.
It also delivers an in-depth analysis of how
different acoustic features shape the final results.
The results represent the cosine similarity between the speaker embeddings of the original and cloned samples, generated by the WavLM model.
1 | 0.8356 | 0.8881 | 0.7618 | 0.8539 | 0.8135 | 0.8167 |
The results represent the cosine similarity between the speaker embeddings of the original and cloned samples, generated by the WavLM model. The values can be filtered by dataset or emotional state.
1 | 0.8098 | 0.7962 | 0.8459 | 0.8201 | 0.8247 |
1 | 0.8098 | 0.7962 | 0.8459 | 0.8201 | 0.777 | |
2 | 0.7923 | 0.7834 | 0.8128 | 0.7483 | 0.8247 | |
3 | 0.7623 | 0.7488 | 0.7683 | 0.7645 | 0.7678 | |
4 | 0.7462 | 0.7305 | 0.661 | 0.7899 | 0.8034 | |
5 | 0.7197 | 0.7613 | 0.7815 | 0.5696 | 0.7664 |
The results represent the cosine similarity between the values of selected acoustic features of the original and cloned samples. The values can be filtered by dataset or emotional state.
1 | 0.5943 | 0.5863 | 0.3993 | 0.5315 | 0.6963 | 0.7579 |
1 | 0.5943 | 0.5863 | 0.3993 | 0.5315 | 0.6963 | 0.7579 | |
2 | 0.5833 | 0.5818 | 0.5619 | 0.4485 | 0.6516 | 0.6725 | |
3 | 0.5659 | 0.6094 | 0.5131 | 0.526 | 0.4617 | 0.7195 | |
4 | 0.541 | 0.5287 | 0.4238 | 0.3917 | 0.658 | 0.703 | |
5 | 0.5274 | 0.5278 | 0.396 | 0.309 | 0.5758 | 0.8283 |
📝 About
The Open Voice Cloning Leaderboard is part of the ClonEval benchmark. In addition to the Leaderboard, the benchmark consists of:
- a deterministic evaluation protocol that sets defaults for data, metrics, and models to be used in the voice cloning assessment process,
- an open-source software library that can be used to evaluate voice cloning models in a reproducible manner.
Evaluation Procedure
The evaluation procedure involves two stages. First, samples are generated using a voice cloning model. The model must take as input a sample of voice to be cloned and a text sample of an utterance.
Following the generation of samples through the voice cloning model, an evaluation is conducted by obtaining speaker embeddings with the WavLM model. For each pair of samples (reference and generated), the cosine similarity between their speaker embeddings from WavLM and between the values of acoustic features extracted from samples is calculated. The similarity values obtained on all samples from a given dataset are averaged to obtain the final evaluation result.
For the purpose of conducting fine-grained error analysis, we also extract acoustic features from each sample with Librosa.
Software Library
The code for the evaluation procedure is available in the GitHub repository (here).
✉️✨ Submit Your Model Here! ✨✉️
Help us improve the leaderboard by submitting your voice cloning model.
📌 How to Submit Your Model:
✉️ Step 1: Send an email to cloneval@csi.wmi.amu.edu.pl.
🔗 Step 2: Include the link to your voice cloning model.
🏆 Step 3: Once evaluated, your model will join the leaderboard.
Thanks for sharing your work with us and making this project even better!