A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Last update: Oct 23, 2022

Related tags

Overview

wav2vec-toolkit

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

This repository accompanies the 🤗 HuggingFace Community Paper on finetuning Wav2Vec2 XLSR for low-resource languages [link]

How to contribute

(Mostly identical to the huggingface/datasets contributing guide)

Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

Clone your fork to your local disk, and add the base repository as a remote:

git clone [email protected]:<your Github handle>/wav2vec-toolkit.git
cd wav2vec-toolkit
git remote add upstream https://github.com/anton-l/wav2vec-toolkit.git

Create a new branch to hold your development changes:
```
git checkout -b a-descriptive-name-for-my-changes
```
do not work on the master branch.
Set up a development environment by running the following command in a virtual environment:
```
pip install -e ".[dev]"
```
(If wav2vec-toolkit was already installed in the virtual environment, remove it with pip uninstall wav2vec_toolkit before reinstalling it in editable mode with the -e flag.)
Develop the features on your branch.
Format your code. Run black and isort so that your newly added files look nice with the following command:
```
black --line-length 119 --target-version py36 src scripts
isort src scripts
```
Once you're happy with your implementation, add your changes and make a commit to record your changes locally:
```
git add .
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
```
git fetch upstream
git rebase upstream/main
```
Push the changes to your account using:
```
git push -u origin a-descriptive-name-for-my-changes
```
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Related tags

Overview

wav2vec-toolkit

How to contribute

Owner

Anton Lozhkov

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Anuvada: Interpretable Models for NLP using PyTorch

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

本插件是pcrjjc插件的重置版，可以独立于后端api运行

Scene Text Retrieval via Joint Text Detection and Similarity Learning

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

🏆 • 5050 most frequent words in 109 languages

Lingtrain Aligner — ML powered library for the accurate texts alignment.

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

Share constant definitions between programming languages and make your constants constant again

Named Entity Recognition API used by TEI Publisher

TPlinker for NER 中文/英文命名实体识别

🦆 Contextually-keyed word vectors

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)