Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

aMLP Transformer Model for Japanese

NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Backend for the Autocomplete platform. An AI assisted coding platform.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

A Fast Sequence Transducer Implementation with PyTorch Bindings

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

Residual2Vec: Debiasing graph embedding using random graphs

A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

Huggingface Transformers + Adapters = ❤️

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Journey is a NLP-Powered Developer assistant