Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Last update: Jan 03, 2023

Overview

Espresso

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

We provide state-of-the-art training recipes for the following speech datasets:

What's New:

April 2021: On-the-fly feature extraction from raw waveforms with torchaudio is supported. A LibriSpeech recipe is released here with no dependency on Kaldi and using YAML files (via Hydra) for configuring experiments.
June 2020: Transformer recipes released.
April 2020: Both E2E LF-MMI (using PyChain) and Cross-Entropy training for hybrid ASR are now supported. WSJ recipes are provided here and here as examples, respectively.
March 2020: SpecAugment is supported and relevant recipes are released.
September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly pip installed.

Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To install Espresso from source and develop locally:

git clone https://github.com/freewym/espresso
cd espresso
pip install --editable .

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
pip install kaldi_io sentencepiece soundfile
cd espresso/tools; make KALDI=<path/to/a/compiled/kaldi/directory>

add your Python path to PATH variable in examples/asr_<dataset>/path.sh, the current default is ~/anaconda3/bin.

kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. soundfile is required for reading raw waveform files. Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.

If you want to use PyChain for LF-MMI training, you also need to install PyChain (and OpenFst):

edit PYTHON_DIR variable in espresso/tools/Makefile (default: ~/anaconda3/bin), and then

cd espresso/tools; make openfst pychain

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

License

Espresso is MIT-licensed.

Citation

Please cite Espresso as:

@inproceedings{wang2019espresso,
  title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
  author = {Yiming Wang and Tongfei Chen and Hainan Xu 
            and Shuoyang Ding and Hang Lv and Yiwen Shao 
            and Nanyun Peng and Lei Xie and Shinji Watanabe 
            and Sanjeev Khudanpur},
  booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year = {2019},
}

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Related tags

Overview

Espresso

What's New:

Requirements and Installation

License

Citation

Owner

Yiming Wang

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

A toolkit for document-level event extraction, containing some SOTA model implementations

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Sequence-to-Sequence Framework in PyTorch

Tool to check whether a GCP bucket is public or not.

Knowledge Management for Humans using Machine Learning & Tags

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

多语言降噪预训练模型MBart的中文生成任务

Few-shot Natural Language Generation for Task-Oriented Dialog

Задания КЕГЭ по информатике 2021 на Python

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

A 10000+ hours dataset for Chinese speech recognition

Codes to pre-train Japanese T5 models

Machine translation models released by the Gourmet project

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

Pre-training BERT masked language models with custom vocabulary

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Chinese Named Entity Recognization (BiLSTM with PyTorch)