SentiBERT

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020). https://arxiv.org/abs/2005.04114

Model Architecture

Requirements

Environment

* Python == 3.6.10
* Pytorch == 1.1.0
* CUDA == 9.0.176
* NVIDIA GeForce GTX 1080 Ti
* HuggingFaces Pytorch (also known as pytorch-pretrained-bert & transformers)
* Stanford CoreNLP (stanford-corenlp-full-2018-10-05)
* Numpy, Pickle, Tqdm, Scipy, etc. (See requirements.txt)

Datasets

Datasets include:

* SST-phrase
* SST-5 (almost the same with SST-phrase)
* SST-3 (almost the same with SST-phrase)
* SST-2
* Twitter Sentiment Analysis (SemEval 2017 Task 4)
* EmoContext (SemEval 2019 Task 3)
* EmoInt (Joy, Fear, Sad, Anger) (SemEval 2018 Task 1c)

Note that there are no individual datasets for SST-5. When evaluating SST-phrase, the results for SST-5 should also appear.

File Architecture (Selected important files)

-- /examples/run_classifier_new.py                                  ---> start to train
-- /examples/run_classifier_dataset_utils_new.py                    ---> input preprocessed files to SentiBERT
-- /pytorch-pretrained-bert/modeling_new.py                         ---> detailed model architecture
-- /examples/lm_finetuning/pregenerate_training_data_sstphrase.py   ---> generate pretrained epochs
-- /examples/lm_finetuning/finetune_on_pregenerated_sstphrase.py    ---> pretrain on generated epochs
-- /preprocessing/xxx_st.py                                         ---> preprocess raw text and constituency tree
-- /datasets                                                        ---> datasets
-- /transformers (under construction)                               ---> RoBERTa part

Get Started

Preparing Environment

conda create -n sentibert python=3.6.10
conda activate sentibert

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch

cd SentiBERT/

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip

export PYTHONPATH=$PYTHONPATH:XX/SentiBERT/pytorch_pretrained_bert
export PYTHONPATH=$PYTHONPATH:XX/SentiBERT/
export PYTHONPATH=$PYTHONPATH:XX/

Preprocessing

Split the raw text and golden labels of sentiment/emotion datasets into xxx_train\dev\test.txt and xxx_train\dev\test_label.npy, assuming that xxx represents task name.
Obtain tree information. There are totally three situtations.

For tasks except SST-phrase, SST-2,3,5, put the files into xxx_train\test.txt files into /stanford-corenlp-full-2018-10-05/. To get binary sentiment constituency trees, please run

cd /stanford-corenlp-full-2018-10-05
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse,sentiment -file xxx_train\test.txt -outputFormat json -ssplit.eolonly true -tokenize.whitespace true

The tree information will be stored in /stanford-corenlp-full-2018-10-05/xxx_train\test.txt.json.

For SST-2, please use

cd /stanford-corenlp-full-2018-10-05
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse,sentiment -file sst2_train\dev_text.txt -outputFormat json -ssplit.eolonly true

The tree information will be stored in /stanford-corenlp-full-2018-10-05/sst2_train\dev_text.txt.json.

For SST-phrase and SST-3,5, the tree information was already stored in sstphrase_train\test.txt.

Run /datasets/xxx/xxx_st.py to clean, and store the text and label information in xxx_train\dev\test_text_new.txt and xxx_label_train\dev\test.npy. It also transforms the tree structure into matrices /datasets/xxx/xxx_train\dev\test_span.npy and /datasets/xxx/xxx_train\dev\test_span_3.npy. The first matrix is used as the range of constituencies in the first layer of our attention mechanism. The second matrix is used as the indices of each constituency's children nodes or subwords and itself in the second layer. Specifically, for tasks other than EmoInt, SST-phrase, SST-5 and SST-3, the command is like below:

cd /preprocessing

python xxx_st.py \
        --data_dir /datasets/xxx/ \                         ---> the location where you want to store preprocessed text, label and tree information 
        --tree_dir /stanford-corenlp-full-2018-10-05/ \     ---> the location of unpreprocessed tree information (usually in Stanford CoreNLP repo)
        --stage train \                                     ---> "train", "test" or "dev"

For EmoInt, the command is shown below:

cd /preprocessing

python xxx_st.py \
        --data_dir /datasets/xxx/ \                         ---> the location where you want to store preprocessed text, label and tree information 
        --tree_dir /stanford-corenlp-full-2018-10-05/ \     ---> the location of unpreprocessed tree information (usually in Stanford CoreNLP repo)
        --stage train \                                     ---> "train" or "test"
        --domain joy                                        ---> "joy", "sad", "fear" or "anger". Used in EmoInt task

For SST-phrase, SST-5 and SST-3, since they already have tree information in sstphrase_train\test.txt. In this case, tree_dir should be /datasets/sstphrase/ or /datasets/sst-3/. The command is shown below:

cd /preprocessing

python xxx_st.py \
        --data_dir /datasets/xxx/ \                         ---> the location where you want to store preprocessed text, label and tree information 
        --tree_dir /datasets/xxx/ \                         ---> the location of unpreprocessed tree information    
        --stage train \                                     ---> "train" or "test"

Pretraining

Generate epochs for preparation

cd /examples/lm_finetuning

python3 pregenerate_training_data_sstphrase.py \
        --train_corpus /datasets/sstphrase/sstphrase_train_text_new.txt \
        --data_dir /datasets/sstphrase/ \
        --bert_model bert-base-uncased \
        --do_lower_case \
        --output_dir /training_sstphrase \
        --epochs_to_generate 3 \
        --max_seq_len 128 \

Pretrain the generated epochs

CUDA_VISIBLE_DEVICES=7 python3 finetune_on_pregenerated_sstphrase.py \
        --pregenerated_data /training_sstphrase \
        --bert_model bert-base-uncased \
        --do_lower_case \
        --output_dir /results/sstphrase_pretrain \
        --epochs 3

The pre-trained parameters were released here. [Google Drive]

Fine-tuning

Run run_classifier_new.py directly as follows:

cd /examples

CUDA_VISIBLE_DEVICES=7 python run_classifier_new.py \
  --task_name xxx \                              ---> task name
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir /datasets/xxx \                     ---> the same name as task_name
  --pretrain_dir /results/sstphrase_pretrain \   ---> the location of pre-trained parameters
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size xxx \
  --learning_rate xxx \
  --num_train_epochs xxx \                                                          
  --domain xxx \                                 ---> "joy", "sad", "fear" or "anger". Used in EmoInt task
  --output_dir /results/xxx \                    ---> the same name as task_name
  --seed xxx \
  --para xxx                                     ---> "sentibert" or "bert": pretrained SentiBERT or BERT

Checkpoints

For reproducity and usability, we provide checkpoints and the original training settings to help you reproduce: Link of overall result folder: [Google Drive]

SST-phrase [Google Drive]
SST-5 [Google Drive]
SST-2 [Google Drive]
SST-3 [Google Drive]
EmoContext [Google Drive]
EmoInt:
- Joy [Google Drive]
- Fear [Google Drive]
- Sad [Google Drive]
- Anger [Google Drive]
Twitter Sentiment Analysis [Google Drive]

The implementation details and results are shown below:

Note: 1) BERT denotes BERT w/ Mean pooling. 2) The results of subtasks in EmoInt is (Joy: 68.90, 65.18, 4 epochs), (Anger: 68.17, 66.73, 4 epochs), (Sad: 66.25, 63.08, 5 epochs), (Fear: 65.49, 64.79, 5 epochs), respectively.

Models	Batch Size	Learning Rate	Epochs	Seed	Results
SST-phrase
SentiBERT	32	2e-5	5	30	68.98
BERT*	32	2e-5	5	30	65.22
SST-5
SentiBERT	32	2e-5	5	30	56.04
BERT*	32	2e-5	5	30	50.23
SST-2
SentiBERT	32	2e-5	1	30	93.25
BERT	32	2e-5	1	30	92.08
SST-3
SentiBERT	32	2e-5	5	77	77.34
BERT*	32	2e-5	5	77	73.35
EmoContext
SentiBERT	32	2e-5	1	0	74.47
BERT	32	2e-5	1	0	73.64
EmoInt
SentiBERT	16	2e-5	4 or 5	77	67.20
BERT	16	2e-5	4 or 5	77	64.95
Twitter
SentiBERT	32	6e-5	1	45	70.2
BERT	32	6e-5	1	45	69.7

Analysis

Here we provide analysis implementation in our paper. We will focus on the evaluation of

local difficulty
global difficulty
negation
contrastive relation

In preprocessing part, we provide implementation to extract related information in the test set of SST-phrase and store them in

-- /datasets/sstphrase/swap_test_new.npy                   ---> global difficulty
-- /datasets/sstphrase/edge_swap_test_new.npy              ---> local difficulty
-- /datasets/sstphrase/neg_new.npy                         ---> negation
-- /datasets/sstphrase/but_new.npy                         ---> contrastive relation

In simple_accuracy_phrase(), we will provide statistical details and evaluate for each metric.

Some of the analysis results based on our provided checkpoints are selected and shown below:

Models	Results
Local Difficulty
SentiBERT	[85.39, 60.80, 49.40]
BERT*	[83.00, 55.54, 31.97]
Negation
SentiBERT	[78.45, 76.25, 70.56]
BERT*	[75.04, 71.40, 68.77]
Contrastive Relation
SentiBERT	39.87
BERT*	28.48

Acknowledgement

Here we would like to thank for BERT/RoBERTa implementation of HuggingFace and sentiment tree parser of Stanford CoreNLP. Also, thanks for the dataset release of SemEval. To confirm the privacy rule of SemEval task organizer, we only choose the publicable datasets of each task.

Citation

Please cite our ACL paper if this repository inspired your work.

@inproceedings{yin2020sentibert,
  author    = {Yin, Da and Meng, Tao and Chang, Kai-Wei},
  title     = {{SentiBERT}: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics},
  booktitle = {Proceedings of the 58th Conference of the Association for Computational Linguistics, {ACL} 2020, Seattle, USA},
  year      = {2020},
}

Contact

Due to the difference of environment, the results will be a bit different. If you have any questions regarding the code, please create an issue or contact the owner of this repository.

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

Related tags

Overview

SentiBERT

Model Architecture

Requirements

Environment

Datasets

File Architecture (Selected important files)

Get Started

Preparing Environment

Preprocessing

Pretraining

Fine-tuning

Checkpoints

Analysis

Acknowledgement

Citation

Contact

Owner

Da Yin

Proximal Backpropagation - a neural network training algorithm that takes implicit instead of explicit gradient steps

Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud.

G-NIA model from "Single Node Injection Attack against Graph Neural Networks" (CIKM 2021)

This repository is for our EMNLP 2021 paper "Automated Generation of Accurate & Fluent Medical X-ray Reports"

Unbiased Learning To Rank Algorithms (ULTRA)

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation

BisQue is a web-based platform designed to provide researchers with organizational and quantitative analysis tools for 5D image data. Users can extend BisQue by implementing containerized ML workflows.

A python library for time-series smoothing and outlier detection in a vectorized way.

A smaller subset of 10 easily classified classes from Imagenet, and a little more French

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

A free, multiplatform SDK for real-time facial motion capture using blendshapes, and rigid head pose in 3D space from any RGB camera, photo, or video.

Spatiotemporal resampling methods for mlr3

Pytorch implementation of the paper "Class-Balanced Loss Based on Effective Number of Samples"

using yolox+deepsort for object-tracker

Malware Env for OpenAI Gym

Neural Turing Machines (NTM) - PyTorch Implementation

Deep learning image registration library for PyTorch

Vehicle speed detection with python

Build tensorflow keras model pipelines in a single line of code. Created by Ram Seshadri. Collaborators welcome. Permission granted upon request.