Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Overview

Transformer Quantization

This repository contains the implementation and experiments for the paper presented in

Yelysei Bondarenko1, Markus Nagel1, Tijmen Blankevoort1, "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP 2021. [ACL Anthology] [ArXiv]

1 Qualcomm AI Research (Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.)

Reference

If you find our work useful, please cite

@inproceedings{bondarenko-etal-2021-understanding,
    title = "Understanding and Overcoming the Challenges of Efficient Transformer Quantization",
    author = "Bondarenko, Yelysei  and
      Nagel, Markus  and
      Blankevoort, Tijmen",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.627",
    pages = "7947--7969",
    abstract = "Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges {--} namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme {--} per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at \url{https://github.com/qualcomm-ai-research/transformer-quantization}.",
}

How to install

First, ensure locale variables are set as follows:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

Second, make sure to have Python ≥3.6 (tested with Python 3.6.8) and ensure the latest version of pip (tested with 21.2.4):

pip install --upgrade --no-deps pip

Next, install PyTorch 1.4.0 with the appropriate CUDA version (tested with CUDA 10.0, CuDNN 7.6.3):

pip install torch==1.4.0 torchvision==0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining dependencies using pip:

pip install -r requirements.txt

To run the code, the project root directory needs to be added to your pythonpath:

export PYTHONPATH="${PYTHONPATH}:/path/to/this/dir"

Running experiments

The main run file to reproduce all experiments is main.py. It contains 4 commands to train and validate FP32 and quantized model:

Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  train-baseline
  train-quantized
  validate-baseline
  validate-quantized

You can see the full list of options for each command using python main.py [COMMAND] --help.

A. FP32 fine-tuning

To start with, you need to get the fune-tuned model(s) for the GLUE task of interest. Example run command for fine-tuning:

python main.py train-baseline --cuda --save-model --model-name bert_base_uncased --task rte \
    --learning-rate 3e-05 --batch-size 8 --eval-batch-size 8 --num-epochs 3 --max-seq-length 128 \
    --seed 1000 --output-dir /path/to/output/dir/

You can also do it directly using HuggingFace library [examples]. In all experiments we used seeds 1000 - 1004 and reported the median score. The sample output directory looks as follows:

/path/to/output/dir
├── config.out
├── eval_results_rte.txt
├── final_score.txt
├── out
│   ├── config.json  # Huggingface model config
│   ├── pytorch_model.bin  # PyTorch model checkpoint
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json  # Huggingface tokenizer config
│   ├── training_args.bin
│   └── vocab.txt  # Vocabulary
└── tb_logs  # TensorBoard logs
    ├── 1632747625.1250594
    │   └── events.out.tfevents.*
    └── events.out.tfevents.*

For validation (both full-precision and quantized), it is assumed that these output directories with the fine-tuned checkpoints are aranged as follows (you can also use a subset of GLUE tasks):

/path/to/saved_models/
├── rte/rte_model_dir
│   ├── out
│   │   ├── config.json  # Huggingface model config
│   │   ├── pytorch_model.bin  # PyTorch model checkpoint
│   │   ├── tokenizer_config.json  # Huggingface tokenizer config
│   │   ├── vocab.txt  # Vocabulary
│   │   ├── (...)
├── cola/cola_model_dir
│   ├── out
│   │   ├── (...)
├── mnli/mnli_model_dir
│   ├── out
│   │   ├── (...)
├── mrpc/mrpc_model_dir
│   ├── out
│   │   ├── (...)
├── qnli/qnli_model_dir
│   ├── out
│   │   ├── (...)
├── qqp/qqp_model_dir
│   ├── out
│   │   ├── (...)
├── sst2/sst2_model_dir
│   ├── out
│   │   ├── (...)
└── stsb/stsb_model_dir
    ├── out
    │   ├── (...)

Note, that you have to create this file structure manually.

The model can then be validated as follows:

python main.py validate-baseline --eval-batch-size 32 --seed 1000 --model-name bert_base_uncased \
    --model-path /path/to/saved_models/ --task rte

You can also validate multiple or all checkpoints by specifying --task --task [...] or --task all, respectively.

B. Post-training quantization (PTQ)

1) Standard (naïve) W8A8 per-tensor PTQ / base run command for all PTQ experiments

python main.py validate-quantized --act-quant --weight-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --n-bits 8 --n-bits-act 8 --qmethod symmetric_uniform \
	--qmethod-act asymmetric_uniform --weight-quant-method MSE --weight-opt-method golden_section \
	--act-quant-method current_minmax --est-ranges-batch-size 1 --num-est-batches 1 \
	--quant-setup all

Note that the range estimation settings are slightly different for each task.

2) Mixed precision W8A{8,16} PTQ

Specify --quant-dict "{'y': 16, 'h': 16, 'x': 16}":

  • 'x': 16 will set FFN's input to 16-bit
  • 'h': 16 will set FFN's output to 16-bit
  • 'y': 16 will set FFN's residual sum to 16-bit

For STS-B regression task, you will need to specify --quant-dict "{'y': 16, 'h': 16, 'x': 16, 'P': 16, 'C': 16}" and --quant-setup MSE_logits, which will also quantize pooler and the final classifier to 16-bit and use MSE estimator for the output.

3) Per-embedding and per-embedding-group (PEG) activation quantization

  • --per-embd -- Per-embedding quantization for all activations
  • --per-groups [N_GROUPS] -- PEG quantization for all activations, no permutation
  • --per-groups [N_GROUPS] --per-groups-permute -- PEG quantization for all activations, apply range-based permutation (separate for each quantizer)
  • --quant-dict "{'y': 'ng6', 'h': 'ng6', 'x': 'ng6'}" -- PEG quantization using 6 groups for FFN's input, output and residual sum, no permutation
  • --quant-dict "{'y': 'ngp6', 'h': 'ngp6', 'x': 'ngp6'}" --per-groups-permute-shared-h -- PEG quantization using 6 groups for FFN's input, output and residual sum, apply range-based permutation (shared between tensors in the same layer)

4) W4A32 PTQ with AdaRound

python main.py validate-quantized --weight-quant --no-act-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --qmethod symmetric_uniform --qmethod-act asymmetric_uniform --n-bits 4 \
	--weight-quant-method MSE --weight-opt-method grid --num-candidates 100 --quant-setup all \
	--adaround all --adaround-num-samples 1024 --adaround-init range_estimator \
	--adaround-mode learned_hard_sigmoid --adaround-asym --adaround-iters 10000 \
	--adaround-act-quant no_act_quant

C. Quantization-aware training (QAT)

Base run command for QAT experiments (using W4A8 for example):

python main.py train-quantized --cuda --do-eval --logging-first-step --weight-quant --act-quant \
	--pad-to-max-length --learn-ranges --tqdm --batch-size 8 --seed 1000 \
	--model-name bert_base_uncased --learning-rate 5e-05 --num-epochs 6 --warmup-steps 186 \
	--weight-decay 0.0 --attn-dropout 0.0 --hidden-dropout 0.0 --max-seq-length 128 --n-bits 4 \
	--n-bits-act 8 --qmethod symmetric_uniform --qmethod-act asymmetric_uniform \
	--weight-quant-method MSE --weight-opt-method golden_section --act-quant-method current_minmax \
	--est-ranges-batch-size 16 --num-est-batches 1 --quant-setup all \
	--model-path /path/to/saved_models/rte/out --task rte --output-dir /path/to/qat_output/dir

Note that the settings are slightly different for each task (see Appendix).

To run mixed-precision QAT with 2-bit embeddings and 4-bit weights, add --quant-dict "{'Et': 2}".

Owner
An initiative of Qualcomm Technologies, Inc.
Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Deep Image Search - AI-Based Image Search Engine Deep Image Search is an AI-based image search engine that includes deep transfer learning features Ex

139 Jan 01, 2023
🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

AI City 2021: Connecting Language and Vision for Natural Language-Based Vehicle Retrieval 🏆 The 1st Place Submission to AICity Challenge 2021 Natural

82 Dec 29, 2022
Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website • Installation • Main

Pytorch Lightning 1.4k Dec 30, 2022
Code for: Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification

Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification Prerequisite PyTorch = 1.2.0 Python3 torch

16 Dec 14, 2022
SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

Learning to Classify Images without Labels This repo contains the Pytorch implementation of our paper: SCAN: Learning to Classify Images without Label

Wouter Van Gansbeke 1.1k Dec 30, 2022
Official Implementation of SWAD (NeurIPS 2021)

SWAD: Domain Generalization by Seeking Flat Minima (NeurIPS'21) Official PyTorch implementation of SWAD: Domain Generalization by Seeking Flat Minima.

Junbum Cha 97 Dec 20, 2022
MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

Update (20 Jan 2020): MODALS on text data is avialable MODALS MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space Table of Conte

38 Dec 15, 2022
Little tool in python to watch anime from the terminal (the better way to watch anime)

ani-cli Script working again :), thanks to the fork by Dink4n for the alternative approach to by pass the captcha on gogoanime A cli to browse and wat

Harshith 4.5k Dec 31, 2022
Fuzzing JavaScript Engines with Aspect-preserving Mutation

DIE Repository for "Fuzzing JavaScript Engines with Aspect-preserving Mutation" (in S&P'20). You can check the paper for technical details. Environmen

gts3.org (<a href=[email protected])"> 190 Dec 11, 2022
Defending against Model Stealing via Verifying Embedded External Features

Defending against Model Stealing Attacks via Verifying Embedded External Features This is the official implementation of our paper Defending against M

20 Dec 30, 2022
Active and Sample-Efficient Model Evaluation

Active Testing: Sample-Efficient Model Evaluation Hi, good to see you here! 👋 This is code for "Active Testing: Sample-Efficient Model Evaluation". P

Jannik Kossen 19 Oct 30, 2022
Advances in Neural Information Processing Systems (NeurIPS), 2020.

What is being transferred in transfer learning? This repo contains the code for the following paper: Behnam Neyshabur*, Hanie Sedghi*, Chiyuan Zhang*.

Google Research 36 Aug 26, 2022
Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python

deepface Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid

Kushal Shingote 2 Feb 10, 2022
Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT)

Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT) Paper, Project Page This repo contains the official implementation of CVPR

Yassine 344 Dec 29, 2022
When are Iterative GPs Numerically Accurate?

When are Iterative GPs Numerically Accurate? This is a code repository for the paper "When are Iterative GPs Numerically Accurate?" by Wesley Maddox,

Wesley Maddox 1 Jan 06, 2022
Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

EMI-FGSM This repository contains code to reproduce results from the paper: Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021) Xiaosen Wa

John Hopcroft Lab at HUST 10 Sep 26, 2022
Breast Cancer Detection 🔬 ITI "AI_Pro" Graduation Project

BreastCancerDetection - This program is designed to predict two severity of abnormalities associated with breast cancer cells: benign and malignant. Mammograms from MIAS is preprocessed and features

6 Nov 29, 2022
New AidForBlind - Various Libraries used like OpenCV and other mentioned in Requirements.txt

AidForBlind Recommended PyCharm IDE Various Libraries used like OpenCV and other

Aalhad Chandewar 1 Jan 13, 2022
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

DeepVecFont This is the homepage for "DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning". Yizhi Wang and Zhouhui Lian. WI

Yizhi Wang 17 Dec 22, 2022