Improving Factual Consistency of Abstractive Text Summarization

Overview

Improving Factual Consistency of Abstractive Text Summarization

We provide the code for the papers:

  1. "Entity-level Factual Consistency of Abstractive Text Summarization", EACL 2021.
    • We provide a set of new metrics to quantify the entity-level factual consistency of generated summaries. We also provide code for the two methods in our paper:
      • JAENS: joint entity and summary generation, and
      • Summary-worthy entity classification with summarization (multi-task learning)
  2. "Improving Factual Consistency of Abstractive Summarization via Question Answering", ACL-IJCNLP 2021
    • QUALS, a new automatic metric for factual consistency.
    • CONSEQ, a new contrastive learning algorithm for Seq2seq models to optimize sequence level objectives such as QUALS.

Our code is based on the fairseq library and we added support for model training on Sagemaker.

Requirements and setup

  • python==3.6: conda create -n entity_fact python=3.6
  • pytorch==1.4.0: pip install torch==1.4.0 torchvision==0.5.0
  • run pip install --editable ./
  • install file2rouge following instructions here
  • download en_core_web_lg: python -m spacy download en_core_web_lg

Entity-level Factual Consistency of Abstractive Text Summarization

Data preprocessing:

We provide three options to preprocess summarization data through the filter_level option.

  • filter_level=0: no special processing
  • filter_level=1: remove corruption text in source articles and summaries. (Undesirable texts included as a result of imperfect data collection. e.g. "Share this with Email, Facebook, Messenger". Undesirable summaries such as "Collection of all USATODAY.com coverage of People, including articles, videos, photos, and quotes.")
  • filter_level=2: entity hallucination filtering in addition to corruption text removal. A summary sentence is removed if it contains a named entity not in the source document.

XSUM:

  1. Follow the instructions here to download and extract text from HTML files and establish the xsum-extracts-from-downloads directory.
  2. Let be the directory that contains the xsum-extracts-from-downloads directory and XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json.
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_xsum --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

CNNDM:

  1. Download and unzip the stories directories from here for both CNN and Daily Mail. Put all .story files in a directory /raw_stories .
  2. Download the url files mapping_train.txt, mapping_test.txt and mapping_valid.txt from here to .
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_cnndm --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

NEWSROOM:

  1. Download the datasets following instructions from here.
  2. Run python preprocess/data_prepro_clean.py --mode preprocess_newsroom --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

Tokenize and binarize the data:

Download bpe encoder.json, vocabulary and fairseq dictionary to a directory, say ; then tokenize and binarize the data.

wget -O <bpe-dir>/encoder.json 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -O <bpe-dir>/vocab.bpe 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
wget -N <bpe-dir>/dict.txt' https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

cd preprocess
python data_prepro_clean.py --mode bpe_binarize --input_dir <processed-data-dir> --tokenizer_dir <bpe-dir>

This generates the binary input files as well as the dictionaries under /data_bin for fairseq.

JAENS: Joint Entity and Summary Generation

The idea is to train the seq2seq model to generate .

1. prepare the train/val data by generating entity augmented targets:

python preprocess/create_entity_classification_labels.py --base_dir <processed-data-dir> --type entity_augment --tokenizer_dir <bpe-dir>

Binarize the augmented targets: cd preprocess

python data_prepro_clean.py --mode binarize --input_dir 
   
    /entity_augment --tokenizer_dir 
    

    
   

Since we already binarized the source documents, we just need to create symbolic links to put all binary input files together for fairseq training:

ln -s <processed-data-dir>/data_bin/train.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/train.source-target.source.idx
ln -s <processed-data-dir>/data_bin/train.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/train.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.idx

2. Fine-tune the BART-large model on the generated data:

Run the launch scripts scripts/launch_xsum.py, scripts/launch_cnndm.py or scripts/launch_newsroom.py to fine-tune the BART-large model. Note you need to modify the following in the scripts:

  • hyperparameters.
  • train_path: location of the binary input files. e.g. /entity_augment/data_bin .
  • init_path: location of the pre-trained BART-large model checkpoint. Please rename the checkpoint to pretrained_model.pt
  • output_path: location for the model outputs.

If training locally, you need to specify ngpus - the number of GPUS in the local machine. Example command:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type local

If training on Sagemaker, you need to specify the docker image name (image_name) as well as execution role (role). To create Sagemaker docker container and push to ECR:

./build_and_push.sh 
   

   

To launch training job:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type sagemaker

3. Generate the summaries from the fine-tuned models:

preprocess/multi_gpu_generate.py is used to generate summaries.

Since the JAENS models generates the named entities before the summaries, we need to remove the named entities before evaluating the summaries. Example command:

python evaluate_hypo.py --mode remove_ent_from_hypo --base_dir 
   
     --sub_dir 
    
      --split val --pattern .*.hypo

    
   

4. To evaluate the generated summaries for ROUGE as well as entity level factual scores:

We use the tokenizer from Stanford CoreNLP package. Example command:

export CLASSPATH=path/to/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar
python evaluate_hypo.py --mode evaluate_summary --base_dir <output-dir> --sub_dir <output-sub-dir> --split val --pattern .*.hypo

See preprocess/run_*_eval.sh for examples.

Summary-worthy entity classification with summarization (multi-task learning)

We perform summary-worthy entity classification at a classification head on the encoder while keeping the seq2seq objective at the decoder. For training, we need to preprocess the input document by create B-I-O labels to identify summary-worthy entities:

python create_entity_classification_labels.py --base_dir 
   
     --type cls_labels --tokenizer_dir 
    
     
python data_prepro_clean.py --mode binarize_cls_labels --input_dir 
     
       --output_dir 
      
       /data_bin --tokenizer_dir 
        
       
      
     
    
   

Launch training jobs using scripts scripts/launch_multitask_*.py.

Improving Factual Consistency of Abstractive Summarization via Question Answering

QUALS (QUestion Answering with Language model score for Summarization)

To evaluate the QUALS of summaries (e.g. test.target) given original input (e.g. test.source), we execute the following steps in the preprocess sub-directory.

0. Prepare summaries into jsonl format

python evaluate_hypo.py --mode convert_hypo_to_json --base_dir 
   
     --sub_dir 
    
      --split test --pattern .target

    
   

1. Generating question and answer pairs from summaries

python sm_inference_asum.py --task gen_qa --base_dir 
   
     --source_dir 
    
      --output_dir 
     
       --num_workers 
      
        --bsz 5 --beam 60 --max_len 60 --min_len 8 --checkpoint_dir 
       
         --ckp_file checkpoint2.pt --bin_dir 
        
         /data_bin --diverse_beam_groups 60 --diverse_beam_strength 0.5 --batch_lines True --input_file test.target.hypo --return_token_scores True 
        
       
      
     
    
   

Here, we use diverse beam search to generate 60 question-answer pairs for each summary. The batch_lines option is set to True to batch bsz input summaries together for efficient generation. The QAGen model is trained by fine-tuning BART on the SQuAD and NewsQA datasets by concatenating the question-answer pairs using a separator.

To train the QAGen model, place the dev-v1.1.json and train-v1.1.json of SQuAD and the combined-newsqa-data-v1.json of the NewsQA under . The following command generates the binarized input for fine-tuning BART using Fairseq.

python data_prepro_clean.py --mode newsqa_squad_prepro --input_dir 
   
     --output_dir 
    

    
   

You can also download our trained QAGen model from s3 by running:

aws s3 cp s3://fact-check-summarization/newsqa-squad-qagen-checkpoint/checkpoint2.pt 
   
    /

   

Alternatively, you can download here if you don't have awscli.

2. Filter the generated question and answer for high quality pairs

python evaluate_hypo.py --mode filter_qas_dataset_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.target.hypo.beam60.qas

    
   

3. Evaluate the generated question and answer pairs using the source document as input

python sm_inference_asum.py --task qa_eval --base_dir 
   
     --output_dir 
    
      --num_workers 
     
       --bsz 30 --checkpoint_dir 
      
        --ckp_file checkpoint2.pt --bin_dir 
       
        /data_bin --qas_dir 
        
          --source_file test.source --target_file test.target --input_file test.target.qas_filtered --prepend_target False 
        
       
      
     
    
   

4. Compute QUALS scores for each summary

python evaluate_hypo.py --mode compute_hypos_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.*.source_eval_noprepend

    
   

CONSEQ (CONtrastive SEQ2seq learning)

To use QUALS to improve factual consistency of the summarization model using the CONSEQ algorithm, we follow the steps:

  1. Obtain the MLE summarization baseline by fine-tuning the BART model. Note that in the ACL paper, we used the corruption-filtered CNNDM and XSUM datasets (filter_level=1).
  2. Use the MLE summarization model to sample summaries on the training data.
  3. Evaluate the QUALS for the generated summaries as well as the ground truth summaries of the training data.
  4. Form the positive and negative sets for contrastive learning.
  5. Fine-tune the MLE summarization model using the positive and negative examples. Example scripts for launching training jobs locally or on Sagemaker are preprocess/run_generate_unlikelihood_train_cnndm.sh and preprocess/run_generate_unlikelihood_train_xsum.sh.

We provide an example script preprocess/run_generate_unlikelihood_train_xsum.sh to illustrate steps 2-4. Note that

  • To avoid running for a long time and encountering OOM errors and then restarting the whole process, we split the input files into smaller ones. We do this by splitting the source file by line (e.g. each sub-file has 10000 lines):
split -l 10000 train.source train.source.split   
  • The script has to make repeated calls of python sm_inference_asum.py --task gen_qa to generate question-ansewr pairs, as many times as there are sub-files as a result of line splits. The python function automatically checks which sub-files have been processed (based on output files) so it always processes the next available sub-file. If all sub-files have been processed, it will simply do nothing so it's safe if it's called more times than there are available sub-files.
  • Similarly, sm_inference_asum.py --task qa_eval needs to be repeated called to cover all sub-files.
  • The speed of question-answer pairs generation depends on the batch size setting. Depending on the summary file and the batch_lines setting, batching is handled differently. If the summary file contains only a single summary per input document, batch_lines should be set to True and bsz number of input lines are batched together as input to the QAGen model. If the summary file contains multiple summaries per input document, batch_lines should be set to False and the batching is done using bsz within each input example (line). For example, if there are 6 summaries per line in the summary file, we should set batch_lines to False; setting bsz to 7 will batch all the 7 summaries in a line together, which gives the best speed. (setting it higher won't improve speed since we do not do batching over different lines of input as batch_lines is False). On CNN-DM, bsz of 7 would sometimes result in OOM errors with 16G GPU memory so I use 3 or 4; on XSUM, it is safe to use 7.
  • num_workers should be the number of GPUs available on the machine. The lines in each input files will be distributed per GPU.
  • Finally, run the following to concatenate the QUALS scores from the sub-files: cat *.quals > train.source.source_eval_noprepend.quals

Citations

@inproceedings{nan-etal-2021-entity,
    title = "Entity-level Factual Consistency of Abstractive Text Summarization",
    author = "Nan, Feng  and
      Nallapati, Ramesh  and
      Wang, Zhiguo  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Zhang, Dejiao  and
      McKeown, Kathleen  and
      Xiang, Bing",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.235",
    pages = "2727--2733",
}

@inproceedings{nan-etal-2021-improving,
    title = {Improving Factual Consistency of Abstractive Summarization via Question Answering},
    author = "Nan, Feng  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Ng, Patrick  and
      McKeown, Kathleen  and
      Nallapati, Ramesh  and
      Zhang, Dejiao  and
      Wang, Zhiguo  and
      Arnold, Andrew  and
      Xiang, Bing",
    booktitle = {Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP)},
    address = {Virtual},
    month = {August},
    url = {https://arxiv.org/abs/2105.04623},
    year = {2021}
}
Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.

Video Representation Learning by Recognizing Temporal Transformations [Project Page] Simon Jenni, Givi Meishvili, and Paolo Favaro. In ECCV, 2020. Thi

Simon Jenni 46 Nov 14, 2022
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
Pytorch implementation of set transformer

set_transformer Official PyTorch implementation of the paper Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks .

Juho Lee 410 Jan 06, 2023
Laser device for neutralizing - mosquitoes, weeds and pests

Laser device for neutralizing - mosquitoes, weeds and pests (in progress) Here I will post information for creating a laser device. A warning!! How It

Ildaron 1k Jan 02, 2023
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 03, 2023
Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020.

RegNet Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020. Paper | Official Implementation RegNet offer a very

Vishal R 2 Feb 11, 2022
Uses OpenCV and Python Code to detect a face on the screen

Simple-Face-Detection This code uses OpenCV and Python Code to detect a face on the screen. This serves as an example program. Important prerequisites

Denis Woolley (CreepyD) 1 Feb 12, 2022
C3DPO - Canonical 3D Pose Networks for Non-rigid Structure From Motion.

C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion By: David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, Andrea Vedal

Meta Research 309 Dec 16, 2022
Awesome Remote Sensing Toolkit based on PaddlePaddle.

基于飞桨框架开发的高性能遥感图像处理开发套件,端到端地完成从训练到部署的全流程遥感深度学习应用。 最新动态 PaddleRS 即将发布alpha版本!欢迎大家试用 简介 PaddleRS是遥感科研院所、相关高校共同基于飞桨开发的遥感处理平台,支持遥感图像分类,目标检测,图像分割,以及变化检测等常用遥

146 Dec 11, 2022
Using modified BiSeNet for face parsing in PyTorch

face-parsing.PyTorch Contents Training Demo References Training Prepare training data: -- download CelebAMask-HQ dataset -- change file path in the pr

zll 1.6k Jan 08, 2023
[CVPR2021] Invertible Image Signal Processing

Invertible Image Signal Processing This repository includes official codes for "Invertible Image Signal Processing (CVPR2021)". Figure: Our framework

Yazhou XING 281 Dec 31, 2022
PPO Lagrangian in JAX

PPO Lagrangian in JAX This repository implements PPO in JAX. Implementation is tested on the safety-gym benchmark. Usage Install dependencies using th

Karush Suri 2 Sep 14, 2022
A small library for doing fluid simulation with neural networks.

Neural Fluid Fields This is a small library for doing fluid simulation with neural fields. Check out our review paper, Neural Fields in Visual Computi

Towaki 23 Jun 23, 2022
A new data augmentation method for extreme lighting conditions.

Random Shadows and Highlights This repo has the source code for the paper: Random Shadows and Highlights: A new data augmentation method for extreme l

Osama Mazhar 35 Nov 26, 2022
Neural network pruning for finding a sparse computational model for controlling a biological motor task.

MothPruning Scientific Overview Originally inspired by biological nervous systems, deep neural networks (DNNs) are powerful computational tools for mo

Olivia Thomas 0 Dec 14, 2022
Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MusCaps: Generating Captions for Music Audio Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1 1 Queen Mary University of London, 2

Ilaria Manco 57 Dec 07, 2022
The modify PyTorch version of Siam-trackers which are speed-up by TensorRT.

SiamTracker-with-TensorRT The modify PyTorch version of Siam-trackers which are speed-up by TensorRT or ONNX. [Updating...] Examples demonstrating how

9 Dec 13, 2022
「PyTorch Implementation of AnimeGANv2」を用いて、生成した顔画像を元の画像に上書きするデモ

AnimeGANv2-Face-Overlay-Demo PyTorch Implementation of AnimeGANv2を用いて、生成した顔画像を元の画像に上書きするデモです。

KazuhitoTakahashi 21 Oct 18, 2022
DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks

English | 简体中文 Introduction DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks Reference Pat

CV Newbie 28 Dec 13, 2022