Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Overview

Learning Opinion Summarizers by Selecting Informative Reviews

This repository contains the codebase and the dataset for the corresponding EMNLP 2021 paper. Please star the repository and cite the paper if you find it useful.

SelSum is a probabilistic (latent) model that selects informative reviews from large collections and subsequently summarizes them as shown in the diagram below.

AmaSum is the largest abstractive opinion summarization dataset, consisting of more than 33,000 human-written summaries for Amazon products. Each summary is paired, on average, with more than 320 customer reviews. Summaries consist of verdicts, pros, and cons, see the example below.

Verdict: The Olympus Evolt E-500 is a compact, easy-to-use digital SLR camera with a broad feature set for its class and very nice photo quality overall.

Pros:

  • Compact design
  • Strong autofocus performance even in low-light situations
  • Intuitive and easy-to-navigate menu system
  • Wide range of automated and manual features to appeal to both serious hobbyists and curious SLR newcomers

Cons:

  • Unreliable automatic white balance in some conditions
  • Slow start-up time when dust reduction is enabled
  • Compatible Zuiko lenses don't indicate focal distance

1. Setting up

1.1. Environment

The easiest way to proceed is to create a separate conda environment with Python 3.7.0.

conda create -n selsum python=3.7.0

Further, install PyTorch as shown below.

conda install -c pytorch pytorch=1.7.0

In addition, install the essential python modules:

pip install -r requirements.txt

The codebase relies on FairSeq. To avoid version conflicts, please download our version and store it to ../fairseq_lib. Please follow the installation instructions in the unzipped directory.

1.2. Environmental variables

Before running scripts, please add the environmental variables below.

export PYTHONPATH=../fairseq_lib/.:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MKL_THREADING_LAYER=GNU

1.3. Data

The dataset in various formats is available in the dataset folder. To run the model, please binarize the fairseq specific version.

1.4. Checkpoints

We also provide the checkpoints of the trained models. These should be allocated to artifacts/checkpoints.

2. Training

2.1. Posterior and Summarizer training

First, the posterior and summarizer need to be trained. The summarizer is initialized using the BART base model, please download the checkpoint and store it to artifacts/bart. Note: please adjust hyper-parameters and paths in the script if needed.

bash selsum/scripts/training/train_selsum.sh

Please note that REINFORCE-based loss for the posterior training can be negative as the forward pass does not correspond to the actual loss function. Instead, the loss is re-formulated to compute gradients in the backward pass (Eq. 5 in the paper).

2.2. Selecting reviews with the Posterior

Once the posterior is trained (jointly with the summarizer), informative reviews need to be selected. The script below produces binary tags indicating selected reviews.

python selsum/scripts/inference/posterior_select_revs.py --data-path=../data/form  \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/q_sel \
--split=test \
--ndocs=10 \
--batch-size=30

The output can be downloaded and stored to artifacts/output/q_sel.

2.3. Fitting the Prior

Once tags are produced by the posterior, we can fit the prior to approximate it.

bash selsum/scripts/training/train_prior.sh

2.4. Selecting Reviews with the Prior

After the prior is trained, we select informative reviews for downstream summarization.

python selsum/scripts/inference/prior_select_revs.py --data-path=../data/form \
--checkpoint-path=artifacts/checkpoints/prior.pt \
--bart-dir=artifacts/bart \
--output-folder-path=artifacts/output/p_sel \
--split=test \
--ndocs=10 \
--batch-size=10

The output can be downloaded and stored to artifacts/output/p_sel.

3. Inference

3.1. Summary generation

To generate summaries, run the command below:

python selsum/scripts/inference/gen_summs.py --data-path=artifacts/output/p_sel/ \
--bart-dir=artifacts/bart \
--checkpoint-path=artifacts/checkpoints/selsum.pt \
--output-folder-path=artifacts/output/p_summs \
--split=test \
--batch-size=20

The model outputs are also available at artifacts/summs.

3.2. Evaluation

For evaluation, we used a wrapper over ROUGE and the CoreNLP tokenizer.

The tokenizer requires the CoreNLP library to be downloaded. Please unzip it to the artifacts/misc folder. Further, make it visible in the classpath as shown below.

export CLASSPATH=artifacts/misc/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

After the installations, please adjust the paths and use the commands below.

GEN_FILE_PATH=artifacts/summs/test.verd
GOLD_FILE_PATH=../data/form/eval/test.verd

# tokenization
cat "${GEN_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GEN_FILE_PATH}.tokenized"
cat "${GOLD_FILE_PATH}" | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > "${GOLD_FILE_PATH}.tokenized"

# rouge evaluation
files2rouge "${GOLD_FILE_PATH}.tokenized" "${GEN_FILE_PATH}.tokenized"

Citation

@inproceedings{bražinskas2021learning,
      title={Learning Opinion Summarizers by Selecting Informative Reviews}, 
      author={Arthur Bražinskas and Mirella Lapata and Ivan Titov},
      booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
      year={2021},
}

License

Codebase: MIT

Dataset: non-commercial

Notes

  • Occasionally logging stops being printed while the model is training. In this case, the log can be displayed either with a gap or only at the end of the epoch.
  • SelSum is trained with a single data worker process because otherwise cross-parallel errors are encountered.
Owner
Arthur Bražinskas
PhD in NLP at the University of Edinburgh, UK. I work on abstractive opinion summarization.
Arthur Bražinskas
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective

Unofficial pytorch implementation of the paper "Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective"

16 Nov 21, 2022
Mercer Gaussian Process (MGP) and Fourier Gaussian Process (FGP) Regression

Mercer Gaussian Process (MGP) and Fourier Gaussian Process (FGP) Regression We provide the code used in our paper "How Good are Low-Rank Approximation

Aristeidis (Ares) Panos 0 Dec 13, 2021
level1-image-classification-level1-recsys-09 created by GitHub Classroom

level1-image-classification-level1-recsys-09 ❗ 주제 설명 COVID-19 Pandemic 상황 속 마스크 착용 유무 판단 시스템 구축 마스크 착용 여부, 성별, 나이 총 세가지 기준에 따라 총 18개의 class로 구분하는 모델 ?

6 Mar 17, 2022
SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

Learning to Classify Images without Labels This repo contains the Pytorch implementation of our paper: SCAN: Learning to Classify Images without Label

Wouter Van Gansbeke 1.1k Dec 30, 2022
Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [Paper] [Colab is coming soon] Approach Example Usage To r

170 Jan 03, 2023
Attention-guided gan for synthesizing IR images

SI-AGAN Attention-guided gan for synthesizing IR images This repository contains the Tensorflow code for "Pedestrian Gender Recognition by Style Trans

1 Oct 25, 2021
LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021 We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrie

Xuan Hieu Duong 7 Jan 12, 2022
A New Approach to Overgenerating and Scoring Abstractive Summaries

We provide the source code for the paper "A New Approach to Overgenerating and Scoring Abstractive Summaries" accepted at NAACL'21. If you find the code useful, please cite the following paper.

Kaiqiang Song 4 Apr 03, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023
Self-driving car env with PPO algorithm from stable baseline3

Self-driving car with RL stable baseline3 Most of the project develop from https://github.com/GerardMaggiolino/Gym-Medium-Post Please check it out! Th

Sornsiri.P 7 Dec 22, 2022
Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization 0. Environment Environment: python 3.6 and cuda 10

Haitao Yang 62 Dec 30, 2022
A benchmark dataset for mesh multi-label-classification based on cube engravings introduced in MeshCNN

Double Cube Engravings This script creates a dataset for multi-label mesh clasification, with an intentionally difficult setup for point cloud classif

Yotam Erel 1 Nov 30, 2021
Catch-all collection of generative art made using processing

Generative art with Processing.py Some art I have created for fun. Dependencies Processing for Python, see how to download/use here Packages contained

2 Mar 12, 2022
PyTorch GPU implementation of the ES-RNN model for time series forecasting

Fast ES-RNN: A GPU Implementation of the ES-RNN Algorithm A GPU-enabled version of the hybrid ES-RNN model by Slawek et al that won the M4 time-series

Kaung 305 Jan 03, 2023
Simple reimplemetation experiments about FcaNet

FcaNet-CIFAR An implementation of the paper FcaNet: Frequency Channel Attention Networks on CIFAR10/CIFAR100 dataset. how to run Code: python Cifar.py

76 Feb 04, 2021
Very deep VAEs in JAX/Flax

Very Deep VAEs in JAX/Flax Implementation of the experiments in the paper Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on I

Jamie Townsend 42 Dec 12, 2022
Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction

Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction Requirements The code has been tested running under Python 3.7.4, with the foll

zshicode 84 Jan 01, 2023
Human Dynamics from Monocular Video with Dynamic Camera Movements

Human Dynamics from Monocular Video with Dynamic Camera Movements Ri Yu, Hwangpil Park and Jehee Lee Seoul National University ACM Transactions on Gra

215 Jan 01, 2023
This repo includes the supplementary of our paper "CEMENT: Incomplete Multi-View Weak-Label Learning with Long-Tailed Labels"

Supplementary Materials for CEMENT: Incomplete Multi-View Weak-Label Learning with Long-Tailed Labels This repository includes all supplementary mater

Zhiwei Li 0 Jan 05, 2022
ADB-IP-ROTATION - Use your mobile phone to gain a temporary IP address using ADB and data tethering

ADB IP ROTATE This an Python script based on Android Debug Bridge (adb) shell sc

Dor Bismuth 2 Jul 12, 2022