A collection of pre-commit hooks for handling text files.

Overview

texthooks

A collection of pre-commit hooks for handling text files.

In particular, hooks for handling unicode characters which may be undesirable in a repository.

Usage with pre-commit

To use with pre-commit, include this repo and the desired hooks in .pre-commit-config.yaml:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
    - id: fix-ligatures

Standalone Usage

Each hook is usable as a CLI script. Simply

pip install texthooks

and then invoke, e.g.

fix-smartquotes FILENAME

Supported Hooks

fix-smartquotes

This fixes copy-paste from some applications which replace double-quotes with curly quotes. It does not convert corner brackets, braile quotation marks, or angle quotation marks. Those characters are not typically the result of copy-paste errors, so they are allowed.

Low quotation marks vary in usage and meaning by language, and some languages use quotation marks which are facing "outwards" (opposite facing from english). For the most part, these and exotic characters (double-prime quotes) are ignored.

In files with the offending marks, they are replaced and the run is marked as failed.

Overriding Quotation Characters

Two options are available for specifying exactly which characters will be replaced. For ease of use, they are specified as hex-encoded unicode codepoints.

Suppose you wanted to avoid replacing the "Heavy single comma quotation mark ornament" (275C) and the "Heavy single turned comma quotation mark ornament" (275B) characters. You could override the single quote codepoints as follows:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
      # replace default single quote chars with this set:
      # apostrophe, fullwidth apostrophe, left single quote, single high
      # reversed-9 quote, right single quote
      args: ["--single-quote-codepoints", "0027,FF07,2018,201B,2019"]

fix-ligatures

Automatically find and replace ligature characters with their ascii equivalents.

This replaces liguatures which may be created by programs like LaTeX for presentation with their strictly-equivalent ASCII counterparts. For example, fi and ff may be ligature-ized.

This hook converts these back into ASCII so that tools like grep will behave as expected.

forbid-bidi-controls

This is checker which forbids the use of unicode bidirectional text control characters.

These are directional formatting characters which can be used to construct text with unexpected or unclear semantics. For example, in programming languages which allow bidirectional text in statements, "X" = ייִדיש can be written with right-to-left reversal to mean that the variable ייִדיש is assigned a value of "X".

CHANGELOG

0.2.2

  • Fix a bug in CLI argument handling for all hooks

0.2.1

  • Fix a typo in forbid-bidi-controls entrypoint

0.2.0

  • Add the forbid-bidi-controls hook
  • Adjust the handling of file encodings. Files will be read with UTF-8 encoding by default in most cases.

0.1.0

  • Initial release with fix-ligatures and fix-smartquotes hooks
Owner
Stephen Rosen
Stephen Rosen
Add your new words to a text file and get them randomly.

Memorize-New-Words In this very very very little project, I've wrote a code to memorize new english words. Therefore you can add the words and their m

Mostafa 2 Jul 04, 2022
Repositori untuk belajar pemrograman Python dalam bahasa Indonesia

Python Repositori ini berisi kumpulan dari berbagai macam contoh struktur data, algoritma dan komputasi matematika yang diimplementasikan dengan mengg

Bellshade 111 Dec 19, 2022
Build a translation program similar to Google Translate with Python programming language and QT library

google-translate Build a translation program similar to Google Translate with Python programming language and QT library Different parts of the progra

Amir Hussein Sharifnezhad 3 Oct 09, 2021
Python Lex-Yacc

PLY (Python Lex-Yacc) Copyright (C) 2001-2020 David M. Beazley (Dabeaz LLC) All rights reserved. Redistribution and use in source and binary forms, wi

David Beazley 2.4k Dec 31, 2022
Um simulador de caixa registradora com database usando arquivos .txt

🛒 Caixa Registradora V2 ❓ - Como usar? Execute o caixa-registradora.py, nele vai ter um menu interativo, você pode cadastrar diversos produtos em um

Gabriel 0 Sep 25, 2022
A pipeline for making highlighted text stand-alone.

title emoji colorFrom colorTo sdk app_file pinned decontextualizer 📤 green gray streamlit main.py false Decontextualizer As a second step in improvin

Paul Bricman 26 Dec 17, 2022
Amazing GitHub Template - Sane defaults for your next project!

🚀 Useful README.md, LICENSE, CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md, GitHub Issues and Pull Requests and Actions templates to jumpstart your projects.

276 Jan 01, 2023
This repos is auto action which generating a wordcloud made by Twitter.

auto_tweet_wordcloud This repos is auto action which generating a wordcloud made by Twitter. Preconditions Install Python dependencies pip install -r

tubone(Yu Otsubo) 0 Apr 29, 2022
split Word file by chapter

split Word file by chapter we use the mircosoft word api to code this tool api url:https://docs.microsoft.com/zh-cn/dotnet/api/ if this tool is good f

wisdom under lemon trees 5 Nov 06, 2021
Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

Hashm 1 Jan 29, 2022
An anthology of a variety of tools for the Persian language in Python

An anthology of a variety of tools for the Persian language in Python

Persian Tools 106 Nov 08, 2022
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
Extract price amount and currency symbol from a raw text string

price-parser is a small library for extracting price and currency from raw text strings.

Scrapinghub 252 Dec 31, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (supports 16 languages) of Universal Sentence Encoder (USE).

Dani El-Ayyass 47 Sep 05, 2022
Split large XML files into smaller ones for easy upload

Split large XML files into smaller ones for easy upload. Works for WordPress Posts Import and other XML files.

Joseph Adediji 1 Jan 30, 2022
utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

utoken utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresse

Ulf Hermjakob 11 Jan 05, 2023
Hspell, the free Hebrew spellchecker and morphology engine.

Hspell, the free Hebrew spellchecker and morphology engine.

16 Sep 15, 2022
Tools to extract questionaire of finalexam.eu and provide interactive questionaire with summary

AskMe This script is completely terminal based. No user interface is added. You can get the command line options by using the --help argument. Make su

David Loewe 1 Nov 09, 2021
PyNews 📰 Simple newsletter made with python 🐍🗞️

PyNews 📰 Simple newsletter made with python Install dependencies This project has some dependencies (see requirements.txt) that are not included in t

Luciano Felix 4 Aug 21, 2022
Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Vinta Chen 194 Nov 19, 2022