Fast, DB Backed pretrained word embeddings for natural language processing.

Last update: Nov 21, 2022

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Fast, DB Backed pretrained word embeddings for natural language processing.

Related tags

Overview

Embeddings

Installation

Usage

Docker

Contribution

Owner

Victor Zhong

中文生成式预训练模型

Example code for "Real-World Natural Language Processing"

Pangu-Alpha for Transformers

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Trained T5 and T5-large model for creating keywords from text

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

This repository is home to the Optimus data transformation plugins for various data processing needs.

The training code for the 4th place model at MDX 2021 leaderboard A.

A library for finding knowledge neurons in pretrained transformer models.

Build Text Rerankers with Deep Language Models

Basic yet complete Machine Learning pipeline for NLP tasks

Multilingual word vectors in 78 languages

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Open source annotation tool for machine learning practitioners.

A curated list of efficient attention modules

Image2pcl - Enter the metaverse with 2D image to 3D projections

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Mednlp - Medical natural language parsing and utility library

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.