• KSII Transactions on Internet and Information Systems
    Monthly Online Journal (eISSN: 1976-7277)

VQCPC+: An Effective Zero-Shot Cross-Lingual Voice Conversion Method

Vol. 19, No. 12, December 31, 2025
10.3837/tiis.2025.12.015, Download Paper (Free):

Abstract

In recent years, voice conversion has garnered extensive research attention. There are still some shortcomings in zero-shot and cross-lingual voice conversion. The dilemma of zero-shot conversion is that it is difficult to accurately extract the speaker embedding features due to the unknown solution space. The main problem of cross-lingual conversion is that the insufficiency of disentanglement ability. Both of them lead to poor similarity and quality of converted speech. In this paper, we propose a VQCPC+ model to improve zero-shot cross-lingual voice conversion. The VQ encoder provides a huge latent vector space to search for the optimal solution. The CPC can targetedly constrain the potential vector space by predicting time series, thereby helping VQ to improve the accuracy of feature extraction. Further, we introduce signal preprocess technology to physically reduces the degree of entanglement between speaker feature and speech content. We use WORLD vocoder, which uses low-pass filters in different frequency bands to filter the original signal, to modify the pitch to be neutral. Besides WORLD, we also use VTLP, which randomly distorts the spectrum to remove the influence of channel differences, to interfere with timbre to be blurred. The subjective experiments show that the average similarity and quality of the converted speech are respectively increased by 11.9% and 8.6%. The objective experiments show that the average Mel-Cepstral Distance value is decreased by 4.7%. The evaluation of disentanglement ability is shown that the proposed model separates the content more accurately.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article

[IEEE Style]
N. Zhai, Y. Feng, L. Song, J. Li, "VQCPC+: An Effective Zero-Shot Cross-Lingual Voice Conversion Method," KSII Transactions on Internet and Information Systems, vol. 19, no. 12, pp. 4481-4502, 2025. DOI: 10.3837/tiis.2025.12.015.

[ACM Style]
Naibo Zhai, Yanru Feng, Lipeng Song, and Jing Li. 2025. VQCPC+: An Effective Zero-Shot Cross-Lingual Voice Conversion Method. KSII Transactions on Internet and Information Systems, 19, 12, (2025), 4481-4502. DOI: 10.3837/tiis.2025.12.015.

[BibTeX Style]
@article{tiis:105409, title="VQCPC+: An Effective Zero-Shot Cross-Lingual Voice Conversion Method", author="Naibo Zhai and Yanru Feng and Lipeng Song and Jing Li and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2025.12.015}, volume={19}, number={12}, year="2025", month={December}, pages={4481-4502}}