Text vectorization tool to outperform TFIDF for classification tasks

Last update: Dec 29, 2022

Overview

WHAT: Supervised text vectorization tool

Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to scikit-learn so you should be able to test the performance of this supervised methods just with a few changes.

Textvec is compatible with: Python 2.7-3.7.

WHY: Comparison with TFIDF

As you can read in the different articles^1,2 almost on every dataset supervised methods outperform unsupervised. But most text classification examples on the internet ignores that fact.

	IMDB_bin	RT_bin	Airlines Sentiment_bin	Airlines Sentiment_multiclass	20news_multiclass
TF	0.8984	0.7571	0.9194	0.8084	0.8206
TFIDF	0.9052	0.7717	0.9259	0.8118	0.8575
TFPF	0.8813	0.7403	0.9212	NA	NA
TFRF	0.8797	0.7412	0.9194	NA	NA
TFICF	0.8984	0.7642	0.9199	0.8125	0.8292
TFBINICF	0.8984	0.7571	0.9194	NA	NA
TFCHI2	0.8898	0.7398	0.9108	NA	NA
TFGR	0.8850	0.7065	0.8956	NA	NA
TFRRF	0.8879	0.7506	0.9194	NA	NA
TFOR	0.9092	0.7806	0.9207	NA	NA

Here is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection.

For more dataset benchmarks (rotten tomatoes, airline sentiment) see Binary classification quality comparison

Install:

Usage:

pip install textvec

Source code:

git clone https://github.com/textvec/textvec
cd textvec
pip install .

HOW: Examples

The usage is similar to scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
from textvec.vectorizers import TfBinIcfVectorizer

cvec = CountVectorizer().fit(train_data.text)

tficf_vec = TfBinIcfVectorizer(sublinear_tf=True)
tficf_vec.fit(cvec.transform(text), y)

For more detailed examples see Basic example and other notebooks in Examples

Currently implemented methods:

TfIcfVectorizer
TforVectorizer
TfgrVectorizer
TfigVectorizer
Tfchi2Vectorizer
TfrfVectorizer
TfrrfVectorizer
TfBinIcfVectorizer
TfpfVectorizer
SifVectorizer
TfbnsVectorizer

Most of the vectorization techniques you can find in articles^1,2,3. If you see any method with wrong name or reference please commit!

TODO

Docs

REFERENCE

[1] [Deqing Wang and Hui Zhang] Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization
[2] [M. Lan, C. L. Tan, J. Su, and Y. Lu] Supervised and traditional term weighting methods for automatic text categorization
[3] [Sanjeev Arora, Yingyu Liang and Tengyu Ma] A Simple But Tough-To-Beat Baseline For Sentence Embeddings
[4] Thanks aysent for an inspiration

Comments

the bug existed in the vectorizer.py

the 35th line sp.spdiag(y == val ......) might be wrong. i assume that u maybe wanna write sp.spdiag([i for i in y if i == val] ......)

expect your reply,thx!

opened by dongrixinyu 3
AttributeError: 'NoneType' object has no attribute 'transform'
For some reason, I can't use your vectorizers in pipeline. Here is my code:

pipeline = Pipeline([ ('vect', CountVectorizer(stop_words='en', ngram_range=(1,2), analyzer='word')), ('transform', TfIcfVectorizer(sublinear_tf=True)), ('clf', LinearSVC(class_weight='balanced')), ]) pipeline.fit(X, y)

And I got the following error:

D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params) 263 This estimator 264 """ --> 265 Xt, fit_params = self._fit(X, y, **fit_params) 266 if self._final_estimator is not None: 267 self._final_estimator.fit(Xt, y, **fit_params)

D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params) 228 Xt, fitted_transformer = fit_transform_one_cached( 229 cloned_transformer, Xt, y, None, --> 230 **fit_params_steps[name]) 231 # Replace the transformer of the step with the fitted 232 # transformer. This is necessary when loading the transformer

D:\programs\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py in call(self, *args, **kwargs) 340 341 def call(self, *args, **kwargs): --> 342 return self.func(*args, **kwargs) 343 344 def call_and_shelve(self, *args, **kwargs):

D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params) 614 res = transformer.fit_transform(X, y, **fit_params) 615 else: --> 616 res = transformer.fit(X, y, **fit_params).transform(X) 617 # if we have a weight for this transformer, multiply output 618 if weight is None:

AttributeError: 'NoneType' object has no attribute 'transform'

My code works fine with TfIdfTransformer.

Python 3.7.3 sklearn 0.20.3
opened by Tamplier 2
Vectorizers now working with GridSearchCV and Pipelines with parameters
Fixes #20 Fixes #21

By adding sklearn.base.BaseEstimator to all vectorizers they are now capable of handling parameters from GridSearchCV.

Also, when testing I verified that they already got a "toString" for them, so two problems solved.

>>> from textvec.vectorizers import TfIcfVectorizer >>> TfIcfVectorizer() TfIcfVectorizer(norm=None, sublinear_tf=False) >>> from textvec.vectorizers import TfBinIcfVectorizer >>> TfBinIcfVectorizer() TfBinIcfVectorizer(norm='l2', smooth_df=True, sublinear_tf=False)
opened by bernardoduarte 1
Using tficf without target Y

I want to get text vectorization using TfIcfVectorizer() without the need to use Y label vector, Is it possible ?

tficf = vectorizers.TfIcfVectorizer() tficf_train = tficf.fit_transform(cv_train, newsgroups_train.target) # here i should give the vectorizer one argument instead of two tficf_test = tficf.transform(cv_test)

opened by banyous 1
Bump numpy from 1.14.2 to 1.22.0
Bumps numpy from 1.14.2 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Textvec vectorizers doesn't work on GridSearchCV and Pipelines with set_params

The code below throws this error:

AttributeError: 'TfBinIcfVectorizer' object has no attribute 'set_params'

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from textvec.vectorizers import TfBinIcfVectorizer

pipeline = Pipeline([
    ('count', CountVectorizer()),
    ('transformer', TfBinIcfVectorizer()),
    ('model', SVC()),
])

param_grid = {
    'transformer__sublinear_tf': (True, False),
    'count__analyzer': ('word', 'char'),
    'count__max_df': (1.0,),
}

grid_search = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, refit=True)
grid_search.fit(train_x, train_y)

If I replace TfBinIcfVectorizer with TfidfTransformer it does work. So it seems that there is something missing on textvec.

opened by bernardoduarte 0

Missing "toString" from Vectorizers
As shown below by the code snippet, scikit-learn's TfidfTransformer for example have a pretty representation when printed, while textvec's TfIcfVectorizer doesn't. It seems that this happens to all of then.

I'll search for way to do it like scikit-learn does.

>>> from sklearn.feature_extraction.text import TfidfTransformer >>> TfidfTransformer() TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True) >>> from textvec.vectorizers import TfIcfVectorizer >>> TfIcfVectorizer() <textvec.vectorizers.TfIcfVectorizer object at 0x7f88de881490>
opened by bernardoduarte 0
Mark TfBNS as solved in README and add it to the Currently implemented methods

As from this pull request it seems that TfBNS is now solved from the TODO list on README.md.

That means that it should be added to the Currently implemented methods and removed from the TODO.

opened by bernardoduarte 0

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Code:

...
train['title'].isnull().sum() 
# Out: 0
title_countvec = CountVectorizer(ngram_range=(1,3), max_features=300000, lowercase=True)
title_countvec.fit(train['title'], y_train)
train_title_countvec = title_countvec.transform(train['title'])
title_vectorizer = TfIcfVectorizer(norm='l2', sublinear_tf=True)
title_vectorizer.fit(train_title_countvec, y_train)
train_title_countvec = title_countvec.transform(train['title'])
np.isfinite(train_title_countvec.data).all(), np.isinf(train_title_countvec.data).any()
# Out: (True, False)
train_transformed['title'] = title_vectorizer.transform(train_title_countvec) 
# Error

Traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-72-5fdf9718ecba> in <module>()
----> 1 train_transformed['title'] = title_vectorizer.transform(train_title_countvec)

/usr/local/lib/python3.5/dist-packages/textvec/vectorizers.py in transform(self, X, min_freq)
     45         X = X * sp.spdiags(self.k, 0, f, f)
     46         if self.norm:
---> 47             X = normalize(X, self.norm)
     48         return X
     49 

~/.local/lib/python3.5/site-packages/sklearn/preprocessing/data.py in normalize(X, norm, axis, copy, return_norm)
   1410 
   1411     X = check_array(X, sparse_format, copy=copy,
-> 1412                     estimator='the normalize function', dtype=FLOAT_DTYPES)
   1413     if axis == 0:
   1414         X = X.T

~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    429     if sp.issparse(array):
    430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 431                                       force_all_finite)
    432     else:
    433         array = np.array(array, dtype=dtype, order=order, copy=copy)

~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
    304                           % spmatrix.format)
    305         else:
--> 306             _assert_all_finite(spmatrix.data)
    307     return spmatrix
    308 

~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

bug

opened by sharthZ23 4

Releases(v2.0)

v2.0(Sep 12, 2019)
Textvec 1.0.1 -> 2.0

Features

SifVectorizer #5

Scikit-learn compability #7

Improvements

Better examples #9

Unit tests #6

Bug Fixes

Sparse data processing fix #8

Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

41 Dec 27, 2022

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

907 Dec 27, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

基于百度的语音识别，用python实现，pyaudio+pyqt

Speech-recognition 基于百度的语音识别，python3.8(conda)+pyaudio+pyqt+baidu-aip 百度有面向python

1 Jan 03, 2022

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

75 Dec 19, 2022

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

345 Jan 03, 2023

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

132 Nov 25, 2022

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

0 Nov 29, 2022

🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

4 Aug 10, 2022

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

1 Apr 28, 2022

Text vectorization tool to outperform TFIDF for classification tasks

Related tags

Overview

WHAT: Supervised text vectorization tool

WHY: Comparison with TFIDF

Install:

HOW: Examples

Currently implemented methods:

TODO

REFERENCE

Comments

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Releases(v2.0)

v2.0(Sep 12, 2019)

Textvec 1.0.1 -> 2.0

Features

Improvements

Bug Fixes

Owner

BERT, LDA, and TFIDF based keyword extraction in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

基于百度的语音识别，用python实现，pyaudio+pyqt

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

SimCTG - A Contrastive Framework for Neural Text Generation

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Crie tokens de autenticação íntegros e seguros com UToken.

🤖 Basic Financial Chatbot with handoff ability built with Rasa

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Quick insights from Zoom meeting transcripts using Graph + NLP

2021搜狐校园文本匹配算法大赛baseline

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Simple text to phones converter for multiple languages

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio