InferSent sentence embeddings

Last update: Dec 27, 2022

Related tags

Overview

InferSent

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.

Recent changes: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.

Dependencies

This code is written in python. Dependencies include:

Python 2/3
Pytorch (recent version)
NLTK >= 3

Download word vectors

Download GloVe (V1) or fastText (V2) vectors:

mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/
mkdir fastText
curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip fastText/crawl-300d-2M.vec.zip -d fastText/

Use our sentence encoder

We provide a simple interface to encode English sentences. See demo.ipynb for a practical example. Get started with the following steps:

0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:

mkdir encoder
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.

0.1) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

1) Load our pre-trained model (in encoder/):

from models import InferSent
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

2) Set word vector path for the model:

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common English words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK.

4) Encode your sentences (list of n sentences):

embeddings = infersent.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096. Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Evaluate the encoder on transfer tasks

To evaluate the model on transfer tasks, see SentEval. Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:

Model	MR	CR	SUBJ	MPQA	STS14	STS Benchmark	SICK Relatedness	SICK Entailment	SST	TREC	MRPC
`InferSent1`	81.1	86.3	92.4	90.2	.68/.65	75.8/75.5	0.884	86.1	84.6	88.2	76.2/83.1
`InferSent2`	79.7	84.2	92.7	89.4	.68/.66	78.4/78.4	0.888	86.3	84.3	90.8	76.0/83.8
`SkipThought`	79.4	83.1	93.7	89.3	.44/.45	72.1/70.2	0.858	79.5	82.9	88.4	-
`fastText-BoV`	78.2	80.2	91.8	88.0	.65/.63	70.2/68.3	0.823	78.9	82.3	83.4	74.4/82.4

Reference

Please consider citing [1] if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@InProceedings{conneau-EtAl:2017:EMNLP2017,
  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo\"{i}c  and  Bordes, Antoine},
  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {670--680},
  url       = {https://www.aclweb.org/anthology/D17-1070}
}

InferSent sentence embeddings

Related tags

Overview

InferSent

Dependencies

Download word vectors

Use our sentence encoder

Evaluate the encoder on transfer tasks

Reference

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

Related work

Owner

Facebook Research

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

本插件是pcrjjc插件的重置版，可以独立于后端api运行

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

A framework for implementing federated learning

ACL'2021: Learning Dense Representations of Phrases at Scale

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Contract Understanding Atticus Dataset

基于pytorch_rnn的古诗词生成

A highly sophisticated sequence-to-sequence model for code generation

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Text-to-Speech for Belarusian language

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

PortaSpeech - PyTorch Implementation

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages