Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Last update: Jan 02, 2023

Overview

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

English | 中文

❗ Now we provide inferencing code and pre-training models. You could generate any text sounds you want.

⭐ The model training only uses the corpus of neutral emotion, and does not use any strongly emotional speech.

⭐ There are still great challenges in out-of-domain style transfer. Limited by the training corpus, it is difficult for the speaker-embedding or unsupervised style learning (like GST) methods to imitate the unseen data.

⭐ With the help of Unet network and AdaIN layer, our proposed algorithm has powerful speaker and style transfer capabilities.

Infer code or Colab notebook

Demo results

Paper link

😄 The authors are preparing simple, clear, and well-documented training process of Unet-TTS based on Aishell3. It contains:

MFA-based duration alignment
Multi-speaker TTS with speaker_embedding-Instance-Normalization, and this model provides pre-training Content Encoder.
Unet-TTS training
One-shot Voice cloning inference
C++ inference

Stay tuned!

Install Requirements

Install the appropriate TensorFlow and tensorflow-addons versions according to CUDA version.
The default is TensorFlow 2.6 and tensorflow-addons 0.14.0.

pip install TensorFlowTTS

Usage

see file UnetTTS_syn.py or notebook

CUDA_VISIBLE_DEVICES=0 python UnetTTS_syn.py

from UnetTTS_syn import UnetTTS

models_and_params = {"duration_param": "train/configs/unetts_duration.yaml",
                    "duration_model": "models/duration4k.h5",
                    "acous_param": "train/configs/unetts_acous.yaml",
                    "acous_model": "models/acous12k.h5",
                    "vocoder_param": "train/configs/multiband_melgan.yaml",
                    "vocoder_model": "models/vocoder800k.h5"}

feats_yaml = "train/configs/unetts_preprocess.yaml"

text2id_mapper = "models/unetts_mapper.json"

Tts_handel = UnetTTS(models_and_params, text2id_mapper, feats_yaml)

#text: input text
#src_audio: reference audio
#dur_stat: phoneme duration statistis to contraol speed rate
syn_audio, _, _ = Tts_handel.one_shot_TTS(text, src_audio, dur_stat)

Reference

https://github.com/TensorSpeech/TensorFlowTTS

https://github.com/CorentinJ/Real-Time-Voice-Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Related tags

Overview

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Install Requirements

Usage

Reference

Owner

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Azure Text-to-speech service for Home Assistant

Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library.

This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

Scikit-learn style model finetuning for NLP

Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Mycroft Core, the Mycroft Artificial Intelligence platform.

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Kestrel Threat Hunting Language

Translate U is capable of translating the text present in an image from one language to the other.

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

تولید اسم های رندوم فینگیلیش

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

DaCy: The State of the Art Danish NLP pipeline using SpaCy

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.