SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Last update: Jan 02, 2023

Related tags

Overview

SNCSE

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This is the repository for SNCSE.

SNCSE aims to alleviate feature suppression in contrastive learning for unsupervised sentence embedding. In the field, feature suppression means the models fail to distinguish and decouple textual similarity and semantic similarity. As a result, they may overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. And the models may underestimate the semantic similarity of pairs with less words in common. (Please refer to Section 5 of our paper for several instances and detailed analysis.) To this end, we propose to take the negation of original sentences as soft negative samples, and introduce them into the traditional contrastive learning framework through bidirectional margin loss (BML). The structure of SNCSE is as follows:

The performance of SNCSE on STS task with different encoders is:

To reproduce above results, please download the files and unzip it to replace the original file folder. Then download the models, modify the file path variables and run:

python bert_prediction.py
python roberta_prediction.py

To train SNCSE, please download the training file, and put it at /SNCSE/data. You can either run:

python generate_soft_negative_samples.py

to generate soft negative samples, or use our files in /Files/soft_negative_samples.txt. Then you may modify and run train_SNCSE.sh.

To evaluate the checkpoints saved during training on the development set of STSB task, please run:

python bert_evaluation.py
python roberta_evaluation.py

Feel free to contact the authors at [email protected] for any questions.

Please cite SNCSE as

{

Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, Jing Shao.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples.

CoRR, abs/2201.05979, 2022.

}

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Related tags

Overview

SNCSE

Owner

Sense-GVT

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

jiant is an NLP toolkit

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

CMeEE 数据集医学实体抽取

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Built for cleaning purposes in military institutions

InferSent sentence embeddings

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

An implementation of the Pay Attention when Required transformer

Chinese Grammatical Error Diagnosis

KoBART model on huggingface transformers

Understand Text Summarization and create your own summarizer in python

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

a CTF web challenge about making screenshots

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Using BERT-based models for toxic span detection

A programming language with logic of Python, and syntax of all languages.