NVC-Net: End-to-End Adversarial Voice ConversionView Publication
Voice conversion (VC) has gained increasing popularity in many speech synthesis applications. The idea is to change the voice identity from one speaker into another while keeping the linguistic content unchanged. Many VC approaches rely on the use of a vocoder to reconstruct the speech from acoustic features, and as a consequence, the speech quality heavily depends on such a vocoder. In this paper, we propose NVC-Net, an end-to-end adversarial network, which performs VC directly on the raw audio waveform. By disentangling the speaker identity from the speech content, NVC-Net is able to perform non-parallel traditional many-to-many VC as well as zero-shot VC from a short utterance of an unseen target speaker. Importantly, NVC-Net is non-autoregressive and fully convolutional, achieving fast inference. Objective and subjective evaluations on VC tasks show that NVC-Net obtains competitive results with significantly fewer parameters.
Related PublicationsView All
AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling
Bac Nguyen, Fabien Cardinaux, Stefan UhlichParallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, […]
Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation
Bac Nguyen, Stefan Uhlich, Fabien CardinauxSelf-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and […]