Snowball compiler and stemming algorithms

Last update: Jan 07, 2023

Related tags

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Owner

Snowball Stemming language and algorithms

Yet another Python binding for fastText

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

A multi-voice TTS system trained with an emphasis on quality

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Predict the spans of toxic posts that were responsible for the toxic label of the posts

A very simple framework for state-of-the-art Natural Language Processing (NLP)

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

Client library to download and publish models and other files on the huggingface.co hub

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Fast, general, and tested differentiable structured prediction in PyTorch

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Yet Another Compiler Visualizer

Conditional Transformer Language Model for Controllable Generation