AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

triangular schedule is enabled by passing policy="triangular" parameter.
triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)

AdamW optimizer and cosine learning rate annealing with restarts

Related tags

Overview

AdamW optimizer and cosine learning rate annealing with restarts

Cyclical Learning Rates

Example:

Owner

Maksym Pyrozhok

An experimental technique for efficiently exploring neural architectures.

load .txt to train YOLOX, same as Yolo others

Membership Inference Attack against Graph Neural Networks

Implementation of ICCV19 Paper "Learning Two-View Correspondences and Geometry Using Order-Aware Network"

Predicts an answer in yes or no.

Attentive Implicit Representation Networks (AIR-Nets)

CNN Based Meta-Learning for Noisy Image Classification and Template Matching

Retrieval.pytorch - The code we used in [2020 DIGIX]

Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection, AAAI 2021.

This is the repository of shape matching algorithm Iterative Rotations and Assignments (IRA)

RIM: Reliable Influence-based Active Learning on Graphs.

Effect of Deep Transfer and Multi task Learning on Sperm Abnormality Detection

[RSS 2021] An End-to-End Differentiable Framework for Contact-Aware Robot Design

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

Multi-task yolov5 with detection and segmentation based on yolov5

Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation