Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Last update: Nov 24, 2022

Related tags

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.

Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments

pinyin preprocess problem

005804 你当#1我傻啊#3？脑子#1那么大#2怎么#1塞进去#4？ ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

what is 'a_?_n_ao3'

in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

opened by windowxiaoming 2
discriminator output['y_c'] never used

Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

opened by mayfool 2
A question of KL divergence calculation

In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1]，I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me！

opened by JiaYK 2

mfa for multi speaker.

In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

	item [1]:
		class = "IntervalTier"
		name = "words"
		xmin = 0.0
		xmax = 9.4444
		intervals: size = 56
			intervals [1]:
				xmin = 0
				xmax = 0.5700000000000001
				text = ""
			intervals [2]:
				xmin = 0.5700000000000001
				xmax = 0.61
				text = "eng"
			intervals [3]:
				xmin = 0.61
				xmax = 0.79
				text = "s_an1"
			intervals [4]:
				xmin = 0.79
				xmax = 0.89
				text = "eng"
			intervals [5]:
				xmin = 0.89
				xmax = 1.06
				text = "i1"
			intervals [6]:
				xmin = 1.06
				xmax = 1.24
				text = "eng"
			intervals [7]:
				xmin = 1.24
				xmax = 1.3
				text = ""
			intervals [8]:
				xmin = 1.3
				xmax = 1.36
				text = "s_an1"
			intervals [9]:
				xmin = 1.36
				xmax = 1.42
				text = ""
			intervals [10]:
				xmin = 1.42
				xmax = 1.49
				text = "eng"
			intervals [11]:
				xmin = 1.49
				xmax = 1.67
				text = "s_i4"
			intervals [12]:
				xmin = 1.67
				xmax = 1.78
				text = "eng"
			intervals [13]:
				xmin = 1.78
				xmax = 1.91
				text = ""
			intervals [14]:
				xmin = 1.91
				xmax = 1.96
				text = "er4"
			intervals [15]:
				xmin = 1.96
				xmax = 2.06
				text = "eng"
			intervals [16]:
				xmin = 2.06
				xmax = 2.19
				text = ""
			intervals [17]:
				xmin = 2.19
				xmax = 2.35
				text = "i1"
			intervals [18]:
				xmin = 2.35
				xmax = 2.53
				text = "eng"
			intervals [19]:
				xmin = 2.53
				xmax = 3.03
				text = "i1"
			intervals [20]:
				xmin = 3.03
				xmax = 3.42
				text = "eng"
			intervals [21]:
				xmin = 3.42
				xmax = 3.48
				text = "i1"
			intervals [22]:
				xmin = 3.48
				xmax = 3.6
				text = ""
			intervals [23]:
				xmin = 3.6
				xmax = 3.64
				text = "eng"
			intervals [24]:
				xmin = 3.64
				xmax = 3.86
				text = "i1"
			intervals [25]:
				xmin = 3.86
				xmax = 3.99
				text = "eng"
			intervals [26]:
				xmin = 3.99
				xmax = 4.59
				text = ""
			intervals [27]:
				xmin = 4.59
				xmax = 4.869999999999999
				text = "er4"
			intervals [28]:
				xmin = 4.869999999999999
				xmax = 4.9799999999999995
				text = "eng"
			intervals [29]:
				xmin = 4.9799999999999995
				xmax = 5.1899999999999995
				text = "s_i4"
			intervals [30]:
				xmin = 5.1899999999999995
				xmax = 5.34
				text = ""
			intervals [31]:
				xmin = 5.34
				xmax = 5.43
				text = "eng"
			intervals [32]:
				xmin = 5.43
				xmax = 5.6
				text = ""
			intervals [33]:
				xmin = 5.6
				xmax = 5.76
				text = "i1"
			intervals [34]:
				xmin = 5.76
				xmax = 6.279999999999999
				text = "eng"
			intervals [35]:
				xmin = 6.279999999999999
				xmax = 6.359999999999999
				text = "s_an1"
			intervals [36]:
				xmin = 6.359999999999999
				xmax = 6.47
				text = ""
			intervals [37]:
				xmin = 6.47
				xmax = 6.6
				text = "eng"
			intervals [38]:
				xmin = 6.6
				xmax = 6.9399999999999995
				text = "i1"
			intervals [39]:
				xmin = 6.9399999999999995
				xmax = 7.039999999999999
				text = "eng"
			intervals [40]:
				xmin = 7.039999999999999
				xmax = 7.289999999999999
				text = "s_an1"
			intervals [41]:
				xmin = 7.289999999999999
				xmax = 7.369999999999999
				text = "eng"
			intervals [42]:
				xmin = 7.369999999999999
				xmax = 7.6
				text = "s_i4"
			intervals [43]:
				xmin = 7.6
				xmax = 7.699999999999999
				text = "eng"
			intervals [44]:
				xmin = 7.699999999999999
				xmax = 7.869999999999999
				text = ""
			intervals [45]:
				xmin = 7.869999999999999
				xmax = 8.049999999999999
				text = "er4"
			intervals [46]:
				xmin = 8.049999999999999
				xmax = 8.26
				text = ""
			intervals [47]:
				xmin = 8.26
				xmax = 8.299999999999999
				text = "eng"
			intervals [48]:
				xmin = 8.299999999999999
				xmax = 8.36
				text = "s_i4"
			intervals [49]:
				xmin = 8.36
				xmax = 8.389999999999999
				text = ""
			intervals [50]:
				xmin = 8.389999999999999
				xmax = 8.42
				text = "eng"
			intervals [51]:
				xmin = 8.42
				xmax = 8.45
				text = ""
			intervals [52]:
				xmin = 8.45
				xmax = 8.59
				text = "s_an1"
			intervals [53]:
				xmin = 8.59
				xmax = 8.83
				text = ""
			intervals [54]:
				xmin = 8.83
				xmax = 9.1
				text = "eng"
			intervals [55]:
				xmin = 9.1
				xmax = 9.44
				text = "i1"
			intervals [56]:
				xmin = 9.44
				xmax = 9.4444
				text = ""

opened by leon2milan 2

Problem with DDP

Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

opened by zhazl 0

Releases(v1.0.0)

v1.0.0(May 21, 2022)

We release the pretrained models of SyntaSpeech on LJSpeech, Biaobei, and LibriTTS. For pretrained vocoder and datasets, please refer to the provided links in README.md
Source code(tar.gz)
Source code(zip)
biaobei_synta.zip(295.58 MB)
libritts_synta.zip(310.03 MB)
lj_synta.zip(304.98 MB)

Owner

Zhenhui YE

I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.

GitHub Repository

Dimension Reduced Turbulent Flow Data From Deep Vector Quantizers

Dimension Reduced Turbulent Flow Data From Deep Vector Quantizers This is an implementation of A Physics-Informed Vector Quantized Autoencoder for Dat

3 Sep 12, 2022

Multi-Scale Progressive Fusion Network for Single Image Deraining

Multi-Scale Progressive Fusion Network for Single Image Deraining (MSPFN) This is an implementation of the MSPFN model proposed in the paper (Multi-Sc

128 Nov 21, 2022

Reinforcement Learning Theory Book (rus)

206 Nov 27, 2022

Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference This repo is the implementation for SD

36 Nov 28, 2022

PyTorch Implementation of CycleGAN and SSGAN for Domain Transfer (Minimal)

MNIST-to-SVHN and SVHN-to-MNIST PyTorch Implementation of CycleGAN and Semi-Supervised GAN for Domain Transfer. Prerequites Python 3.5 PyTorch 0.1.12

401 Dec 30, 2022

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

943 Jan 07, 2023

Mesh Graphormer is a new transformer-based method for human pose and mesh reconsruction from an input image

MeshGraphormer ✨ ✨ This is our research code of Mesh Graphormer. Mesh Graphormer is a new transformer-based method for human pose and mesh reconsructi

251 Jan 08, 2023

Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

Shakespeare translations using TensorFlow This is an example of using the new Google's TensorFlow library on monolingual translation going from modern

245 Dec 28, 2022

RLHive: a framework designed to facilitate research in reinforcement learning.

RLHive is a framework designed to facilitate research in reinforcement learning. It provides the components necessary to run a full RL experiment, for both single agent and multi agent environments.

88 Jan 05, 2023

Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

35 Jan 03, 2023

A PyTorch Implementation of ViT (Vision Transformer)

ViT - Vision Transformer This is an implementation of ViT - Vision Transformer by Google Research Team through the paper "An Image is Worth 16x16 Word

7 May 11, 2022

Node Editor Plug for Blender

NodeEditor Blender的程序化建模插件 Show Current 基本框架：自定义的tree-node-socket、tree中的node与socket采用字典查询、基于socket入度的拓扑排序数据传递和处理依靠Tree中的字典，socket传递字典key TODO 增加更多的节点

11 Dec 03, 2022

Probabilistic Tensor Decomposition of Neural Population Spiking Activity

Probabilistic Tensor Decomposition of Neural Population Spiking Activity Matlab (recommended) and Python (in developement) implementations of Soulat e

6 Nov 30, 2022

UniFormer - official implementation of UniFormer

UniFormer This repo is the official implementation of "Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning". It curren

573 Jan 04, 2023

A framework that allows people to write their own Rocket League bots.

YOU PROBABLY SHOULDN'T PULL THIS REPO Bot Makers Read This! If you just want to make a bot, you don't need to be here. Instead, start with one of thes

543 Dec 20, 2022

Lipstick ain't enough: Beyond Color-Matching for In-the-Wild Makeup Transfer (CVPR 2021)

Table of Content Introduction Datasets Getting Started Requirements Usage Example Training & Evaluation CPM: Color-Pattern Makeup Transfer CPM is a ho

248 Dec 13, 2022

A customisable game where you have to quickly click on black tiles in order of appearance while avoiding clicking on white squares.

W.I.P-Aim-Memory-Game A customisable game where you have to quickly click on black tiles in order of appearance while avoiding clicking on white squar

1 Dec 08, 2021

TRIQ implementation

TRIQ Implementation TF-Keras implementation of TRIQ as described in Transformer for Image Quality Assessment. Installation Clone this repository. Inst

115 Dec 30, 2022

Some pvbatch (paraview) scripts for postprocessing OpenFOAM data

pvbatchForFoam Some pvbatch (paraview) scripts for postprocessing OpenFOAM data For every script there is a help message available: pvbatch pv_state_s

2 Oct 26, 2022

This repository contains the source code for the paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks",

DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks Project Page | Video | Presentation | Paper | Data L

281 Dec 22, 2022