Simple, hackable offline speech to text - using the VOSK-API.

Overview

Nerd Dictation

Offline Speech to Text for Desktop Linux. - See demo video.

This is a utility that provides simple access speech to text for using in Linux without being tied to a desktop environment.

Simple
This is a single file Python script with minimal dependencies.
Hackable
User configuration lets you manipulate text using Python string operations.
Zero Overhead
As this relies on manual activation there are no background processes.

Dictation is accessed manually with begin/end commands.

This uses the excellent vosk-api.

Usage

It is suggested to bind begin/end/cancel to shortcut keys.

nerd-dictation begin
nerd-dictation end

For details on how this can be used, see: nerd-dictation --help and nerd-dictation begin --help.

Features

Specific features include:

Numbers as Digits

Optional conversion from numbers to digits.

So Three million five hundred and sixty second becomes 3,000,562nd.

A series of numbers (such as reciting a phone number) is also supported.

So Two four six eight becomes 2,468.

Time Out
Optionally end speech to text early when no speech is detected for a given number of seconds. (without an explicit call to end which is otherwise required).
Output Type
Output can simulate keystroke events (default) or simply print to the standard output.
User Configuration Script
User configuration is just a Python script which can be used to manipulate text using Python's full feature set.

See nerd-dictation begin --help for details on how to access these options.

Dependencies

  • Python 3.
  • The VOSK-API.
  • parec command (for recording from pulse-audio).
  • xdotool command to simulate keyboard input.

Install

pip3 install vosk
git clone https://github.com/ideasman42/nerd-dictation.git
cd nerd-dictation
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model

To test dictation:

./nerd-dictation begin --vosk-model-dir=./model &
# Start speaking.
./nerd-dictation end
  • Reminder that it's up to you to bind begin/end/cancel to actions you can easily access (typically key shortcuts).

  • To avoid having to pass the --vosk-model-dir argument, copy the model to the default path:

    mkdir -p ~/.config/nerd-dictation
    mv ./model ~/.config/nerd-dictation

Hint

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

Configuration

This is an example of a trivial configuration file which simply makes the input text uppercase.

# ~/.config/nerd-dictation/nerd-dictation.py
def nerd_dictation_process(text):
    return text.upper()

A more comprehensive configuration is included in the examples/ directory.

Hints

  • The processing function can be used to implement your own actions using keywords of your choice. Simply return a blank string if you have implemented your own text handling.
  • Context sensitive actions can be implemented using command line utilities to access the active window.

Paths

Local Configuration
~/.config/nerd-dictation/nerd-dictation.py
Language Model

~/.config/nerd-dictation/model

Note that --vosk-model-dir=PATH can be used to override the default.

Command Line Arguments

Output of nerd-dictation --help

usage:

nerd-dictation [-h]  ...

This is a utility that activates text to speech in Linux. While it could use any system currently it uses the VOSK-API.

positional arguments:

begin: Begin dictation.
end: End dictation.
cancel: Cancel dictation.
optional arguments:
-h, --help show this help message and exit

Subcommand: begin

usage:

nerd-dictation begin [-h] [--cookie FILE_PATH] [--vosk-model-dir DIR]
                     [--pulse-device-name IDENTIFIER]
                     [--sample-rate HZ] [--defer-output] [--continuous]
                     [--timeout SECONDS] [--idle-time SECONDS]
                     [--delay-exit SECONDS]
                     [--punctuate-from-previous-timeout SECONDS]
                     [--full-sentence] [--numbers-as-digits]
                     [--numbers-use-separator] [--output OUTPUT_METHOD]
                     [- ...]

This creates the directory used to store internal data, so other commands such as sync can be performed.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).
--vosk-model-dir DIR
  Path to the VOSK model, see: https://alphacephei.com/vosk/models
--pulse-device-name IDENTIFIER
  The name of the pulse-audio device to use for recording. See the output of "pactl list sources" to find device names (using the identifier following "Name:").
--sample-rate HZ
  The sample rate to use for recording (in Hz). Defaults to 44100.
--defer-output

When enabled, output is deferred until exiting.

This prevents text being typed during speech (implied with --output=STDOUT)

--continuous Enable this option, when you intend to keep the dictation process enabled for extended periods of time. without this enabled, the entirety of this dictation session will be processed on every update. Only used when --defer-output is disabled.
--timeout SECONDS
  Time out recording when no speech is processed for the time in seconds. This can be used to avoid having to explicitly exit (zero disables).
--idle-time SECONDS
  Time to idle between processing audio from the recording. Setting to zero is the most responsive at the cost of high CPU usage. The default value is 0.1 (processing 10 times a second), which is quite responsive in practice (the maximum value is clamped to 0.5)
--delay-exit SECONDS
  The time to continue running after an exit request. this can be useful so "push to talk" setups can be released while you finish speaking (zero disables).
--punctuate-from-previous-timeout SECONDS
  The time-out in seconds for detecting the state of dictation from the previous recording, this can be useful so punctuation it is added before entering the dictation(zero disables).
--full-sentence
  Capitalize the first character. This is also used to add either a comma or a full stop when dictation is performed under the --punctuate-from-previous-timeout value.
--numbers-as-digits
  Convert numbers into digits instead of using whole words.
--numbers-use-separator
  Use a comma separators for numbers.
--output OUTPUT_METHOD
 

Method used to at put the result of speech to text.

  • SIMULATE_INPUT simulate keystrokes (default).
  • STDOUT print the result to the standard output. Be sure only to handle text from the standard output as the standard error may be used for reporting any problems that occur.
- ... End argument parsing.
This can be used for user defined arguments which configuration scripts may read from the sys.argv.

Subcommand: end

usage:

nerd-dictation end [-h] [--cookie FILE_PATH]

This ends dictation, causing the text to be typed in.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).

Subcommand: cancel

usage:

nerd-dictation cancel [-h] [--cookie FILE_PATH]

This cancels dictation.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).

Details

  • Typing in results will never press enter/return.
  • Pulse audio is used for recording.
  • Recording and speech to text a performed in parallel.

Examples

Store the result of speech to text as a variable in the shell:

SPEECH="$(nerd-dictation begin --timeout=1.0 --output=STDOUT)"

Example Configurations

These are example configurations you may use as a reference.

Other Software

  • Elograf - nerd-dictation GUI front-end that runs as a tray icon.

Limitations

  • Text from VOSK is all lower-case, while the user configuration can be used to set the case of common words like I this isn't very convenient (see the example configuration for details).

  • For some users the delay in start up may be noticeable on systems with slower hard disks especially when running for the 1st time (a cold start).

    This is a limitation with the choice not to use a service that runs in the background. Recording begins before any the speech-to-text components are loaded to mitigate this problem.

Further Work

  • And a general solution to capitalize words (proper nouns for example).
  • Wayland support (this should be quite simple to support and mainly relies on a replacement for xdotool).
  • Add a setup.py for easy installation on uses systems.
  • Possibly other speech to text engines (only if they provide some significant benefits).
  • Possibly support Windows & macOS.
Owner
Campbell Barton
Campbell Barton
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🤗 Contributing to OpenSpeech 🤗 OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 03, 2023
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Segmenter - Transformer for Semantic Segmentation

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022