SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Last update: Nov 07, 2022

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

This repo contains our codes for the paper "No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models" (ICLR 2022).

Getting Start

Pull and run docker
pytorch/pytorch:1.5.1-cuda10.1-cudnn7-devel
Install requirements
pip install -r requirements.txt

Data and Model

Download data and pre-trained models
./download.sh
Please refer to this link for details on the GLUE benchmark.
Preprocess data
./experiments/glue/prepro.sh
For the most updated data processing details, please refer to the mt-dnn repo.

Fine-tuning Pre-trained Models using SAGE

We provide an example script for fine-tuning a pre-trained BERT-base model on MNLI using Adamax-SAGE:

./scripts/train_mnli_usadamax.sh GPUID

A few notices:

learning_rate and beta3 are two of the most important hyper-parameters. learning_rate that works well for Adamax/AdamW-SAGE is usually 2 to 5 times larger than that works well for Adamax/AdamW, depending on the tasks. beta3 that works well for Adamax/AdamW-SAGE is usually in the range of 0.6 and 0.9, depending on the tasks.
To use AdamW-SAGE, set argument --optim=usadamw. The current codebase only contains the implementation of Adamax-SAGE and AdamW-SAGE. Please refer to module/bert_optim.py for details. Please refer to our paper for integrating SAGE on other optimizers.
To fine-tune a pre-trained RoBERTa-base model, set arguments --init_checkpoint to the model path and set --encoder_type to 2. Other supported models are listed in pretrained_models.py.
To fine-tune on other tasks, set arguments --train_datasets and --test_datasets to the corresponding task names.

Citation

@inproceedings{
liang2022no,
title={No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models},
author={Chen Liang and Haoming Jiang and Simiao Zuo and Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen and Tuo Zhao},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=cuvga_CiVND}
}

Contact Information

For help or issues related to this package, please submit a GitHub issue. For personal questions related to this paper, please contact Chen Liang ([email protected]).

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Related tags

Overview

SAGE: Sensitivity-guided Adaptive Learning Rate for Transformers

Getting Start

Data and Model

Fine-tuning Pre-trained Models using SAGE

Citation

Contact Information

Owner

Chen Liang

AI assistant built in python.the features are it can display time,say weather,open-google,youtube,instagram.

Python code for the paper How to scale hyperparameters for quickshift image segmentation

Cupytorch - A small framework mimics PyTorch using CuPy or NumPy

A collection of differentiable SVD methods and also the official implementation of the ICCV21 paper "Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?"

Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Object recognition using Azure Custom Vision AI and Azure Functions

A lossless neural compression framework built on top of JAX.

CC-GENERATOR - A python script for generating CC

Gender Classification Machine Learning Model using Sk-learn in Python with 97%+ accuracy and deployment

Systematic generalisation with group invariant predictions

PyTorch implementation of "A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing"

Res2Net for Instance segmentation and Object detection using MaskRCNN

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

PyTorch code to run synthetic experiments.

A comprehensive and up-to-date developer education platform for Urbit.

Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

A tf.keras implementation of Facebook AI's MadGrad optimization algorithm

This is implementation of AlexNet(2012) with 3D Convolution on TensorFlow (AlexNet 3D).

Implementation of the ALPHAMEPOL algorithm, presented in Unsupervised Reinforcement Learning in Multiple Environments.