German Text-To-Speech Engine using Tacotron and Griffin-Lim

Last update: Aug 28, 2022

Related tags

Overview

jotts

JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due to real time usage I decided not to include a vocoder and use griffin-lim instead which results in a more robotic voice but is much faster.

API

First create an instance of JoTTS. The initializer takes force_model_download as an optional parameter in case that the last download of the synthesizer failed and the model cannot be applied.
Call speak with a text parameter that contains the text to speak out loud. The second parameter can be set to True, to wait until speaking is done.
Use text2wav to create a wav file instead of speaking the text.

Example usage

from jotts import JoTTS
jotts = JoTTS()
jotts.speak("Das Wetter heute ist fantastisch.", True)
jotts.text2wav("Es war aber auch schon mal besser!")

Todo

Add an option to change the default audio device to speak the text
Add a parameter to select other models but the default model
Add threading or multi processing to allow speaking without blocking
Add a vocoder instead of griffin-lim to improve audio output.

Training a model for your own voice

Training a synthesizer model is easy - if you know how to do it. I created a course on udemy to show you how it is done. Don't buy the tutorial for the full price, there is a discout every month :-)

https://www.udemy.com/course/voice-cloning/

If you neither have the backgroud or the resources or if you are just lazy or too rich, contact me for contract work. Cloning a voice normally needs ~15 Minutes of clean audio from the voice you want to clone.

Disclaimer

I hope that my (and any other person's) voice will be used only for legal and ethical purposes. Please do not get into mischief with it.

Comments

SSL: CERTIFICATE_VERIFY_FAILED

my code is

from jotts import JoTTS
jotts = JoTTS()
jotts.speak("Das Wetter heute ist fantastisch.", True)
jotts.textToWav("Es war aber auch schon mal besser!")

and I receive this :

2022-11-01 09:39:57.536 | DEBUG    | jotts.jotts:__init__:66 - Initializing JoTTS...
2022-11-01 09:39:57.537 | DEBUG    | jotts.jotts:__prepare_model__:50 - There is no tts model yet, downloading...
2022-11-01 09:39:57.537 | DEBUG    | jotts.jotts:__prepare_model__:60 - Download file: https://github.com/padmalcom/jotts/releases/download/v0.1/v0.1.pt
v0.1.pt: 0.00B [00:00, ?B/s]

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1016, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1392, in connect
    server_hostname=server_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 412, in wrap_socket
    session=session
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 853, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1117, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    jotts = JoTTS()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jotts/jotts.py", line 68, in __init__
    MODEL_FILE = self.__prepare_model__(force_model_download);
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jotts/jotts.py", line 62, in __prepare_model__
    urllib.request.urlretrieve(DOWNLOAD_URL, filename=MODEL_FILE, reporthook=t.update_to)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

what am I doing wrong. ? Thanks !

opened by deladriere 3

Samples of jotts in combination with a modern vocoder like (MB)Melgan, HifiGAN

I tried to drop a spectrogram sanmple as npy and feed HifiGAN but it gave me a lot of noise. I am wondering how good your results are, do you have samples with vocoders like above?

opened by eqikkwkp25-cyber 2

jotts.text2wav not existing / needs jotts.textToWav

running this example on MacOS 11.6

from jotts import JoTTS

jotts = JoTTS()
jotts.speak("Das Wetter heute ist fantastisch.", True)
jotts.speak("Wir sind Die Roboter.", True)
jotts.text2wav("Es war aber auch schon mal besser!")

give an error trying to generate the wav file (The speak function works really well !)

2021-12-14 17:41:22.415 | DEBUG    | jotts.jotts:__init__:66 - Initializing JoTTS...
2021-12-14 17:41:22.415 | DEBUG    | jotts.jotts:__init__:83 - Using CPU for inference.
2021-12-14 17:41:22.415 | DEBUG    | jotts.jotts:__init__:85 - Loading the synthesizer...
Synthesizer using device: cpu
Trainable Parameters: 30.874M
Loaded synthesizer "v0.1.pt" trained to step 79000

| Generating 1/1
[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.


Done.

| Generating 1/1


Done.

Traceback (most recent call last):
  File "test_jotts.py", line 6, in <module>
    jotts.text2wav("Es war aber auch schon mal besser!")
AttributeError: 'JoTTS' object has no attribute 'text2wav'

using jotts.textToWav works well but there is still this [W NNPACK.cpp:79] message here is the output

2021-12-14 17:45:31.699 | DEBUG    | jotts.jotts:__init__:66 - Initializing JoTTS...
2021-12-14 17:45:31.700 | DEBUG    | jotts.jotts:__init__:83 - Using CPU for inference.
2021-12-14 17:45:31.700 | DEBUG    | jotts.jotts:__init__:85 - Loading the synthesizer...
Synthesizer using device: cpu
Trainable Parameters: 30.874M
Loaded synthesizer "v0.1.pt" trained to step 79000

| Generating 1/1
[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.


Done.


| Generating 1/1


Done.


| Generating 1/1


Done.

opened by deladriere 2

can this run on a Rapsberry Pi Zero ?

Sorry not an issue but I would like to have a Raspberry Pi Zero speak German without the need for an Internet connection (Amazon Polly and IBM Watson have great German voices but are paid service quite complex to install - not to mention the need for a connect and its delays) I just subscribed to your course (I understand only a bit of German) ;-) Maybe some of the heavy work can be done on a fast computer but I need the text to speech to be done on the Raspberry Pi ?

opened by deladriere 2
Missing additional information in README

Typo somewhere: The readme says "The synthesizer model has been trained on my voice using Tacotron1." while the releases say "v0.1 Latest Pre-trained German synthesizer model based on tacotron2."

Can you add more hints how you trained your model(s), i.e. which base repository, data structure and how many hours of your voice you need for the current results?

opened by eqikkwkp25-cyber 1

Releases(generic_v0.4)

generic_v0.4(Dec 30, 2022)

Trained for 98k steps on german common voice dataset.
Source code(tar.gz)
Source code(zip)
generic_v0.4.pt(353.51 MB)
vocoder_v0.1(Nov 8, 2022)

WaveRNN vocoder trained for 142.000 steps. Can be used instead of griffin-lim algorithm, might deliver better results but requires more ressources to apply.
Source code(tar.gz)
Source code(zip)
vocoder_v0.1.pt(51.40 MB)
jonas_v0.1(Nov 22, 2021)

Pre-trained German synthesizer model based on tacotron.
Source code(tar.gz)
Source code(zip)
jonas_v0.1.pt(353.49 MB)
generic_v0.3(Oct 27, 2022)

Trained for 75k steps on high quality voice.
Source code(tar.gz)
Source code(zip)
generic_v0.3.pt(353.49 MB)

Owner

padmalcom

PhD in Computer Science, interested in machine learning, game programming and robotics. Hope my projects help somewhere.

GitHub Repository

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 09, 2021

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

75 Dec 05, 2022

Awesome-NLP-Research (ANLP)

72 Dec 19, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

FewCLUE: 为中文NLP定制的小样本学习测评基准

387 Jan 04, 2023

NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

2.5k Jan 04, 2023

Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

37 Dec 25, 2022

Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 01, 2022

CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021

A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

251 Jan 01, 2023

Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

29 Nov 19, 2022

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Related tags

Overview

jotts

API

Example usage

Todo

Training a model for your own voice

Disclaimer

Comments

SSL: CERTIFICATE_VERIFY_FAILED

Samples of jotts in combination with a modern vocoder like (MB)Melgan, HifiGAN

jotts.text2wav not existing / needs jotts.textToWav

can this run on a Rapsberry Pi Zero ?

Missing additional information in README

Releases(generic_v0.4)

generic_v0.4(Dec 30, 2022)

vocoder_v0.1(Nov 8, 2022)

jonas_v0.1(Nov 22, 2021)

generic_v0.3(Oct 27, 2022)

Owner

padmalcom

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

Awesome-NLP-Research (ANLP)

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

FewCLUE: 为中文NLP定制的小样本学习测评基准

NLP made easy

Codes to pre-train Japanese T5 models

Estimation of the CEFR complexity score of a given word, sentence or text.

VD-BERT: A Unified Vision and Dialog Transformer with BERT

CATs: Semantic Correspondence with Transformers

A collection of GNN-based fake news detection models.

Prompt tuning toolkit for GPT-2 and GPT-Neo

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

The aim of this task is to predict someone's English proficiency based on a text input.

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing