Get list of common stop words in various languages in Python

Overview

Python Stop Words

Overview

Get list of common stop words in various languages in Python.

Build Status Coverage Status PyPI Version PyPI Status License PyPI Py_versions

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Indonesian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

  • Python 2.7
  • Python 3.4
  • Python 3.5
  • Python 3.6
  • Python 3.7
Comments
  • Enforces packaging of eggs into folders.

    Enforces packaging of eggs into folders.

    We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

    This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

    Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

    opened by hfjn 10
  • add indonesian stop word list

    add indonesian stop word list

    Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

    opened by frankdevans 4
  • can you handle a text?

    can you handle a text?

    hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

    thank you very much!!

    question 
    opened by PapaMadeleine2022 2
  • Python 3 support

    Python 3 support

    List of improvements:

    • Tests
    • Python 3 support
    • Dev installation via zc.buildout
    • Continuous integration via Travis

    Can you make a new release once the branch merged ?

    Regards

    enhancement 
    opened by Fantomas42 2
  • languages.json is missing, if you don't git clone with `--recursive`

    languages.json is missing, if you don't git clone with `--recursive`

    languages.json is still missing, if you don't clone with --recursive

    $ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

    opened by marcindulak 1
  • Update submodule to the latest

    Update submodule to the latest

    Include the stops for newly added languages

    https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7

    enhancement 
    opened by norkans7 1
  • Decode error AND Add catalan language to LANGUAGE_MAPPING

    Decode error AND Add catalan language to LANGUAGE_MAPPING

    1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

    2. Decode error

    stop_words = [line.strip().decode('utf-8')
                 for line in language_file.readlines()]
    

    Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

    The workaround is to reorder the call:

    stop_words = [line.decode('utf-8').strip()
                 for line in language_file.readlines()]
    
    opened by dmiro 1
  • Defining custom stop words in NLTK

    Defining custom stop words in NLTK

    Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

    Hope you can help me thanks.

    opened by AllikDaniel 0
  • Example not work on python 3.7.0

    Example not work on python 3.7.0

    It return empty []

    from stop_words import get_stop_words
    
    stop_words = get_stop_words('en')
    stop_words = get_stop_words('english')
    
    from stop_words import safe_get_stop_words
    
    stop_words = safe_get_stop_words('unsupported language')
    print(stop_words)
    
    opened by nadavvin 2
Releases(2018.7.23)
  • 2018.7.23(Jul 23, 2018)

    2018.7.23

    • Fixed #14: languages.json is missing, if you don't git clone with --recursive.
    • Feature: Support latest version of Python (3.7+).
    • Feature #22: Enforces packaging of eggs into folders.
    • Update the stop-words repository to get the latest languages.
    • Fixed Travis failing and tests due to bootstrap.

    PyPI: https://pypi.org/project/stop-words/2018.7.23/

    To install:

    $ pip install stop-words==2018.7.23
    
    Source code(tar.gz)
    Source code(zip)
  • 2015.2.23.1(Feb 23, 2015)

  • 2015.2.23(Feb 23, 2015)

    2015.2.23


    • Feature: Using the cache is optional
    • Feature: Filtering stopwords

    Special thanks to Taras Labiak @kissarat

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Source code(tar.gz)
    Source code(zip)
  • 2015.2.21(Feb 21, 2015)

    2015.2.21


    • Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json
    • Fix: Made paths OS-independent

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Special thanks to Taras Labiak @kissarat

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.31(Feb 1, 2015)

  • 2015.1.22(Jan 22, 2015)

    2015.1.22


    • Feature: Tests
    • Feature: Python 3 support
    • Feature: Dev installation via zc.buildout
    • Feature: Continuous integration via Travis

    pypi: https://pypi.python.org/pypi/stop-words/2015.1.22

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.19(Jan 19, 2015)

Owner
Alireza Savand
I am Alireza Savand, a Software Architect.
Alireza Savand
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022