VQMIVC - Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Overview

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

arXiv GitHub Stars download

Run VQMIVC on Replicate

Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.

Pre-trained models: google-drive or here | Paper demo

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

📢 Update

Many thanks to ericguizzo & AK391!

  1. A Replicate demo is provided online, so you can play our pre-trained models there, have fun!
  2. VQMIVC can be trained and tested inside a Docker environment via Cog now.
  3. Gradio Web Demo is available, another online demo!

TODO

  • Add more details on how to use Cog for development

Requirements

Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':

pip install -r requirements.txt

Quick start with pre-trained models

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN to try the pre-trained models:

python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path} 

For example:

python convert_example.py -s test_wavs/p225_038.wav -r test_wavs/p334_047.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt 

The converted wav is put in 'converted' directory.

Training and inference:

  • Step1. Data preparation & preprocessing
  1. Put VCTK corpus under directory: 'Dataset/'

  2. Training/testing speakers split & feature (mel+lf0) extraction:

     python preprocess.py
    
  • Step2. model training:
  1. Training with mutual information minimization (MIM):

     python train.py use_CSMI=True use_CPMI=True use_PSMI=True
    
  2. Training without MIM:

     python train.py use_CSMI=False use_CPMI=False use_PSMI=False 
    
  • Step3. model testing:
  1. Put PWG vocoder under directory: 'vocoder/'

  2. Inference with model trained with MIM:

     python convert.py checkpoint=checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt
    
  3. Inference with model trained without MIM:

     python convert.py checkpoint=checkpoints/useCSMIFalse_useCPMIFalse_usePSMIFalse_useAmpTrue/model.ckpt-500.pt
    

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{wang21n_interspeech,
  author={Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},
  title={{VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1344--1348},
  doi={10.21437/Interspeech.2021-283}
}

Acknowledgements:

  • The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
  • The speaker encoder is borrowed from AdaIN-VC;
  • The decoder is modified from AutoVC;
  • Estimation of mutual information is modified from CLUB;
  • Speech features extraction is based on espnet and Pyworld.
Comments
  • The issue of  vocoder  in Inference progress

    The issue of vocoder in Inference progress

    Hi Sir,

    Thank you for your sharing firstly.

    Now I meet a issure about the inference as below:

    raceback (most recent call last): File "convert.py", line 201, in convert(config) File "convert.py", line 194, in convert subprocess.call(cmd) File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 287, in call with Popen(*popenargs, **kwargs) as p: File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 729, in init restore_signals, start_new_session) File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) PermissionError: [Errno 13] Permission denied: 'parallel-wavegan-decode'

    What can I do to solve this problem? The pretrain vocoder I have been put in the vocoder dir.

    (tts) [[email protected] VQMIVC]$ ll vocoder/ 总用量 4 lrwxrwxrwx 1 xxxx xxxx 53 6月 25 10:50 checkpoint-3000000steps.pkl -> ../pretrain_model/vocoder/checkpoint-3000000steps.pkl lrwxrwxrwx 1 xxxx xxxx 36 6月 25 10:50 config.yml -> ../pretrain_model/vocoder/config.yml -rw-r--r-- 1 xxxx xxxx 39 6月 24 17:53 README.md lrwxrwxrwx 1 xxxx xxxx 34 6月 25 10:50 stats.h5 -> ../pretrain_model/vocoder/stats.h5

    opened by TaoTaoFu 14
  • How to slove this problem?

    How to slove this problem?

    Dear Phd WANG: When I run the convert.py file, I meet this problem and i can not slove it, can you give me some suggest? Thank you very much! Error: Traceback (most recent call last): File "convert.py", line 168, in convert '--feats-scp', f'{str(out_dir)}/feats.1.scp', '--outdir', str(out_dir)]) File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 287, in call with Popen(*popenargs, **kwargs) as p: File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 729, in init restore_signals, start_new_session) File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'parallel-wavegan-decode': 'parallel-wavegan-decode'

    opened by Hu-chengyang 9
  • preprocess issue

    preprocess issue

    After downloaded the VCTK Corpus and copy the file under /Dataset (and create a directory '/Dataset/VCTK-Corpus/' to include the file: speaker-info.txt), I run the preprocess.py and get the following result. How can I fix this?

    (voice-clone) C:\Python\VQMIVC>python preprocess.py all_spks: ['257', '294', '304', '297', '226', '282', '247', '330', '361', '252', '293', '306', '340', '231', '268', '283', '243', '334', '315', '269', '285', '310', '230', '311', '374', '307', '286', '323', '245', '227', '239', '240', '363', '284', '251', '318', '246', '265', '244', '228', '333', '276', '255', '225', '308', '260', '339', '312', '336', '347', '345', '258', '335', '270', '376', '237', '316', '326', '364', '273', '263', '259', '267', '292', '232', '229', '254', '264', '287', '278', '236', '317', '272', '233', '234', '248', '249', '305', '299', '281', '302', '329', '262', '351', '288', '298', '250', '343', '256', '300', '275', '341', '279', '277', '271', '241', '303', '274', '313', '266', '301', '253', '261', '314', '295', '360', '362', '238'] len(spk_wavs): 0 len(spk_wavs): 0 len(spk_wavs): 0 . . . len(spk_wavs): 0 len(spk_wavs): 0 len(spk_wavs): 0 0 0 0 extract log-mel... 0it [00:00, ?it/s] normalize log-mel... Traceback (most recent call last): File "preprocess.py", line 141, in mels = np.concatenate(mels, 0) File "<array_function internals>", line 6, in concatenate ValueError: need at least one array to concatenate

    opened by Chuk101 8
  • Question About Batch Size, number of Epochs and Learning Rate

    Question About Batch Size, number of Epochs and Learning Rate

    Hi @Wendison , I've already has trained some models (with VCTK subsets and external speakers) and could notice that a bigger batch size doesn't necessarily results in better audio quality for the same 500 epochs, in some cases, audio quality could be worse (For male References). My question is:

    Do you have any report or experiments with different Batch Sizes, Number of Epochs (Why 500 and not 600 or more), and Learning Rates for different batch sizes?

    If not, what advice could you provide regarding the Batch Size and the number of Epochs? The bigger the better?

    For complex data like this there should be an improvement on bigger batches, but learning rate or number of epochs should be tuned.

    Thank You.

    opened by jlmarrugom 6
  • about the model question

    about the model question

    I try to train the model again,after I finished the process.I used the model that trained by myself to voice conversion, but I got noting. could you give me some advice. I have done all things follow the ReadME

    opened by Mike66666 4
  • Add Docker environment & web demo

    Add Docker environment & web demo

    Hey @Wendison! 👋

    I really liked your implementation and it works very well with any kind of voice! Really funny :)

    This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! View it here: https://replicate.ai/wendison/vqmivc

    Claim your page here so you can edit it, and we'll feature it on our website and tweet about it too.

    In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. We got frustrated that we couldn't run all the really interesting ML work being done. So, we're going round implementing models we like. 😊

    opened by ericguizzo 4
  • The CPCLoss

    The CPCLoss

    I read related papers, but still do not understand the CPC loss computaiton.

        labels = torch.zeros(
            self.n_speakers_per_batch * self.n_utterances_per_speaker, length,
            dtype=torch.long, device=z.device
        )
    
        loss = F.cross_entropy(f, labels)
    

    Can someone explain it for me. Why labels of zeros and cross_entropy used here?

    opened by Liujingxiu23 4
  • NameError: name 'amp' is not defined .   File

    NameError: name 'amp' is not defined . File "train.py", line 407, in train_model

    I am getting below error.

    File "train.py", line 407, in train_model optimizer, optimizer_cs_mi_net, optimizer_ps_mi_net, optimizer_cp_mi_net, scheduler, amp, epoch, checkpoint_dir, cfg) NameError: name 'amp' is not defined

    opened by geni120 3
  • Improper converted audio when source = reference

    Improper converted audio when source = reference

    Hi, I tried using python convert_example.py -s test_wavs/jane3.wav -r test_wavs/jane3.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt to check out how the results are when source audio and reference audio are same. But the output is mostly silent. Am I missing something? To reproduce the results, the audio files and vocoder are uploaded here

    Source and reference: https://drive.google.com/file/d/1bPAQ9UaKJF1gNNCtkeDmySxLv_uXW1HN/view?usp=sharing Converted: https://drive.google.com/file/d/1TmxjpHx3WY3nKRwy5lz04LWfKAo69qwW/view?usp=sharing CC: @Wendison

    opened by vishalbhavani 3
  • What is the

    What is the "parallel-wavegan-decode" in cmd = ['parallel-wavegan-decode', '--checkpoint',...] ,it is a folder???

    Thanks for your code, but I have some problems, In code: cmd = ['parallel-wavegan-decode', '--checkpoint',...], Is it a folder? If so, what does this folder contain? My system told me it couldn't be found

    opened by DIO385 2
  • Where can I  get the silence trimmed VCTK corpus?

    Where can I get the silence trimmed VCTK corpus?

    Hi,

    Thank you for sharing your code! I wonder where can I get the silence trimmed VCTK corpus? Since the VCTK dataset I have only contains *.wav file and in your preprocess.py script it seems that all audio files are *.flac format, I cannot run the script.

    opened by Aria-K-Alethia 2
  • voice conversion not happens after fine-tuned with pretrained model

    voice conversion not happens after fine-tuned with pretrained model

    Hi @Wendison

    Thank you so much for this great work.

    I fine-tuned (resumed) pretrained model (use_CSMI=True use_CPMI=True use_PSMI=True) with indicTTS dataset (20 speakers - each having 1 hour audios)

    the model trained with 1000 epochs.

    Quality gets better for the target speaker. but source speaker modulation is not converted.

    Can you please give your suggestions?

    Thanks

    opened by MuruganR96 0
  • Training for Indian Multi-Speaker/Multi-lingual VC

    Training for Indian Multi-Speaker/Multi-lingual VC

    Hi, @Wendison Thank you so much for your excellent work. very nice paper.

    When I saw this reply on the below issues, it helped me to motivate to go further.

    https://github.com/Wendison/VQMIVC/issues/14#issuecomment-937900528

    https://github.com/Wendison/VQMIVC/issues/17#issuecomment-971136691

    I am trying Common Voice Indian English Multi-Speakers and VCTK Training. I need a few suggestions from you

    Steps:

    1. I add Common Voice Indian English Multi-Speakers (40 speakers - each having 30 minutes Datasets) along with VCTK 109 Speakers. and start training use_CSMI=True use_CPMI=True use_PSMI=True

    2. After the model is trained with good accuracy, will go for fine-tuning with other Indian regional languages of Common Voice (Tamil, Hindi, Urdu, etc)

    is this approach good?

    @Wendison kindly request, please give your suggestions. Thanks

    opened by MuruganR96 0
  • What do z_dim and c_dim stand for?

    What do z_dim and c_dim stand for?

    Dear PHD: Could you tell me what do z_dim:64 and c_dim:256 in config/model/default stand for?And what n_embeddings: 512 in config/model/default stand for?Thank you very much.

    opened by Hu-chengyang 4
  • Training Loss Abnormal

    Training Loss Abnormal

    @andreasjansson @Wendison Hello, sorry to interrupt you! I'm a rookie of voice model. I have trained the model in VCTK-Corpus-0.92.zip dataset by "python3 train.py use_CSMI=True use_CPMI=True use_PSMI=True" in NVIDIA V100S. But after 65 epochs, the train loss are as follows: image Could you give me some advice? Thank you very much!

    opened by Haoyanlong 3
  • lf0 question about convert phase

    lf0 question about convert phase

    Hi, I wonder why you normalize f0 series before feeding to the f0encoder in convert.py. However, this kind of normalization for f0 isn't used in preprocessing phase.

    opened by powei-C 3
  • How to solve this problem?

    How to solve this problem?

    Dear PHD: I try to train a vocoder, and I have installed parallelwavegan,and I run the command: run.sh,however it came out with the traceback: Traceback (most recent call last): File "/home/liyp/anaconda3/envs/xll/bin/parallel-wavegan-preprocess", line 11, in load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-preprocess')() File "/data2/hcy/VQMIVC-main/vocoder/ParallelWaveGAN/parallel_wavegan/bin/preprocess.py", line 186, in main ), f"{utt_id} seems to have a different sampling rate."

    I find that the sampling rate is 24000hz,however the sampling rate of the VQMIVC is 16000,could you tell me how to modify the sampling rate?

    opened by Hu-chengyang 3
Owner
Disong Wang
PhD student @ CUHK, focus on voice conversion & speech synthesis.
Disong Wang
Supplementary materials for ISMIR 2021 LBD paper "Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes"

Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes Supplementary materials for ISMIR 2021 LBD submission: K. N. W

Karn Watcharasupat 2 Oct 25, 2021
Forecasting for knowable future events using Bayesian informative priors (forecasting with judgmental-adjustment).

What is judgyprophet? judgyprophet is a Bayesian forecasting algorithm based on Prophet, that enables forecasting while using information known by the

AstraZeneca 56 Oct 26, 2022
A map update dataset and benchmark

MUNO21 MUNO21 is a dataset and benchmark for machine learning methods that automatically update and maintain digital street map datasets. Previous dat

16 Nov 30, 2022
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Tong 8 Apr 25, 2022
Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

BiDR Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Requirements torch==

Microsoft 11 Oct 20, 2022
Collection of Docker images for ML/DL and video processing projects

Collection of Docker images for ML/DL and video processing projects. Overview of images Three types of images differ by tag postfix: base: Python with

OSAI 87 Nov 22, 2022
An open-source project for applying deep learning to medical scenarios

Auto Vaidya An open source solution for creating end-end web app for employing the power of deep learning in various clinical scenarios like implant d

Smaranjit Ghose 18 May 29, 2022
An implementation of quantum convolutional neural network with MindQuantum. Huawei, classifying MNIST dataset

关于实现的一点说明 山东大学 2020级 苏博南 www.subonan.com 文件说明 tools.py 这里面主要有两个函数: resize(a, lenb) 这其实是我找同学写的一个小算法hhh。给出一个$28\times 28$的方阵a,返回一个$lenb\times lenb$的方阵。因

ぼっけなす 2 Aug 29, 2022
performing moving objects segmentation using image processing techniques with opencv and numpy

Moving Objects Segmentation On this project I tried to perform moving objects segmentation using background subtraction technique. the introduced meth

Mohamed Magdy 15 Dec 12, 2022
Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

Phil Wang 19 May 06, 2022
Experimental Python implementation of OpenVINO Inference Engine (very slow, limited functionality). All codes are written in Python. Easy to read and modify.

PyOpenVINO - An Experimental Python Implementation of OpenVINO Inference Engine (minimum-set) Description The PyOpenVINO is a spin-off product from my

Yasunori Shimura 7 Oct 31, 2022
Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

Microsoft 394 Jan 08, 2023
本步态识别系统主要基于GaitSet模型进行实现

本步态识别系统主要基于GaitSet模型进行实现。在尝试部署本系统之前,建立理解GaitSet模型的网络结构、训练和推理方法。 系统的实现效果如视频所示: 演示视频 由于模型较大,部分模型文件存储在百度云盘。 链接提取码:33mb 具体部署过程 1.下载代码 2.安装requirements.txt

16 Oct 22, 2022
🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗 This year's first semester Club Info challenge will put you at the head of a car racing

ClubINFO INGI (UCLouvain) 6 Dec 10, 2021
OMNIVORE is a single vision model for many different visual modalities

Omnivore: A Single Model for Many Visual Modalities [paper][website] OMNIVORE is a single vision model for many different visual modalities. It learns

Meta Research 451 Dec 27, 2022
PyTorch implementation of "A Simple Baseline for Low-Budget Active Learning".

A Simple Baseline for Low-Budget Active Learning This repository is the implementation of A Simple Baseline for Low-Budget Active Learning. In this pa

10 Nov 14, 2022
A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

Eloi Moliner Juanpere 57 Jan 05, 2023
PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

DECOR-GAN PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement, Zhiqin Chen, Vladimir G. Kim, Matthew Fish

Zhiqin Chen 72 Dec 31, 2022
Multi Camera Calibration

Multi Camera Calibration 'modules/camera_calibration/app/camera_calibration.cpp' is for calculating extrinsic parameter of each individual cameras. 'm

7 Dec 01, 2022
Official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right"

Surface Form Competition This is the official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right" We p

Peter West 46 Dec 23, 2022