Repository containing the code for An-Gocair text normaliser

Overview

Scottish Gaelic Text Normaliser

The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can be cloned and top level functions will allow you to normalise phrases or whole documents.

Installation

To use the program you will have to clone the repo and install dependencies in a python virtualenvironment using python 3 and above.

instructions

from GaelicTextNormaliser import TextNormaliser

normaliser = TextNormaliser(from_config="config.yaml")

normaliser.normalise_doc(doc="Bha rìgh òg Easaidh Ruagh an dèigh dha'n oighreachd fhaotainn da fèin ri mòran àbhachd, ag amharc a mach dè a chordadh ris,'s dè thigeadh r'a nadur.")

"Bha rìgh òg Easaidh Ruadh an dèidh dhan oighreachd fhaotainn da fhèin ri mòran àbhachd, ag amharc a-mach dè a chòrdadh ris,'s dè thigeadh ra nàdar."

Alternatively there is a webapp that can be found at https://www.garg.ed.ac.uk/an_gocair.

Acknowledgements

Scottish Gaelic Lexicon

The lexicon file is provided by Michael Bauer, Scottish Gaelic linguist, author and lead collaborator on the Am Faclaer Baeg SG dictionary. The lexicon is a reformatted version of the dictionary that makes use of Michael's extensive labelling of traditional Gaelic spellings and common misspellings. The resource is extremely vital for the success of the memory based approach.

Rules for Normalisation

The lexical and grammatical rules for normalisation were the result of collaboration between the project leader Dr Will Lamb and Baeur. Both Lamb and Bauer, as fluent Gaelic speakers and experienced proof readers, were able to provide the linguistic rules to be translated into python code.

Scottish Gaelic Part of Speech Tagger

For further conditioning in the rule based approach, part of speech tags were necessary. The code and models for POS tagging is very kindly provided by Loïc Boizou. The scripts were altered slightly to work within the python object.

Further Acknowledgements

This program was funded by the Data-Driven Innovation initiative (DDI), delivered by the University of Edinburgh and Heriot-Watt University for the Edinburgh and South East Scotland City Region Deal. DDI is an innovation network helping organisations tackle challenges for industry and society by doing data right to support Edinburgh in its ambition to become the data capital of Europe. The project was delivered by the Edinburgh Futures Institute (EFI), one of five DDI innovation hubs which collaborates with industry, government and communities to build a challenge-led and data-rich portfolio of activity that has an enduring impact.

BaseCrack is a tool written in Python that can decode all alphanumeric base encoding schemes.

BaseCrack Decoder For Base Encoding Schemes BaseCrack is a tool written in Python that can decode all alphanumeric base encoding schemes. This tool ca

Mufeed VH 383 Dec 27, 2022
AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

Niklas 29 Dec 20, 2022
PyNews 📰 Simple newsletter made with python 🐍🗞️

PyNews 📰 Simple newsletter made with python Install dependencies This project has some dependencies (see requirements.txt) that are not included in t

Luciano Felix 4 Aug 21, 2022
Build a translation program similar to Google Translate with Python programming language and QT library

google-translate Build a translation program similar to Google Translate with Python programming language and QT library Different parts of the progra

Amir Hussein Sharifnezhad 3 Oct 09, 2021
This repository contains scripts to control a RGB text fan attached to a Raspberry Pi.

RGB Text Fan Controller This repository contains scripts to control a RGB text fan attached to a Raspberry Pi. Setup The Raspberry Pi and RGB text fan

Luke Prior 1 Oct 01, 2021
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Rodrigo 2 Dec 14, 2022
py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

I'm Not A Bot #Left_TG 13 Aug 27, 2022
Deasciify-highlighted - A Python script for deasciifying text to Turkish and copying clipboard

deasciify-highlighted is a Python script for deasciifying text to Turkish and copying clipboard.

Ümit Altıntaş 3 Mar 18, 2022
Shows twitch pay for any streamer from Twitch leaked CSV files.

twitch_leak_csv_reader Shows twitch pay for any streamer from Twitch leaked CSV files. Requirements: You need python3 (you can install python 3 from o

5 Nov 11, 2022
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

Brandon 5.6k Jan 03, 2023
一个可以可以统计群组用户发言,并且能将聊天内容生成词云的机器人

当前版本 v2.2 更新维护日志 更新维护日志 有问题请加群组反馈 Telegram 交流反馈群组 点击加入 演示 配置要求 内存:1G以上 安装方法 使用 Docker 安装 Docker官方安装

机器人总动员 117 Dec 29, 2022
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

Python User Agents user_agents is a Python library that provides an easy way to identify/detect devices like mobile phones, tablets and their capabili

Selwin Ong 1.3k Dec 22, 2022
This script has been created in order to find what are the most common demanded technologies in Data Engineering field.

This is a Python script that given a whole corpus of job descriptions and a file with keywords it extracts the number of number of ocurrences of these keywords and write it to a file. This script it

Antonio Bri Pérez 0 Jul 17, 2022
Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Vinta Chen 194 Nov 19, 2022
RSS Reader application for the Emacs Application Framework.

EAF RSS Reader RSS Reader application for the Emacs Application Framework. Load application (add-to-list 'load-path "~/.emacs.d/site-lisp/eaf-rss-read

EAF 15 Dec 07, 2022
Extract knowledge from raw text

Extract knowledge from raw text This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosm

Raphael Sourty 10 Dec 03, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
Goblin-sim - Procedural fantasy world generator

goblin-sim This project is an attempt to create a procedural goblin fantasy worl

3 May 18, 2022
基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Pytex-for-MCM 基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端。

3 May 17, 2021
🚩 A simple and clean python banner generator - Banners

🚩 A simple and clean python banner generator - Banners

Kumar Vicku 12 Oct 09, 2022