meProp: Sparsified Back Propagation for Accelerated Deep Learning

Overview

meProp

The codes were used for the paper meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [pdf] by Xu Sun, Xuancheng Ren, Shuming Ma, Houfeng Wang.

Based on meProp, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. We name this method meSimp (minimal effort simplification). For more details, please see the paper Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method [pdf]. The codes are at [here].

Introduction

We propose a simple yet effective technique to simplify the training of neural networks. The technique is based on the top-k selection of the gradients in back propagation.

In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. We name this method meProp (minimal effort back propagation).

Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the proposed method improves the accuracy of the resulting models rather than degrades the accuracy, and a detailed analysis is given.

The following figure is an illustration of the idea of meProp.

An illustration of the idea of meProp.

TL;DR: Training with meProp is significantly faster than the original back propagation, and has better accuracy on all of the three tasks we used, Dependency Parsing, POS Tagging and MNIST respectively. The method works with different neural models (MLP and LSTM), with different optimizers (we tested AdaGrad and Adam), with DropOut, and with more hidden layers. The top-k selection works better than the random k-selection, and better than normally-trained k-dimensional network.

Update: Results on test set (please refer to the paper for detailed results and experimental settings):

Method (Adam, CPU) Backprop Time (s) Test (%)
Parsing (MLP 500d) 9,078 89.80
Parsing (meProp top-20) 489 (18.6x) 88.94 (+0.04)
POS-Tag (LSTM 500d) 16,167 97.22
POS-Tag (meProp top-10) 436 (37.1x) 97.25 (+0.03)
MNIST (MLP 500d) 170 98.20
MNIST (meProp top-80) 29 (5.9x) 98.27 (+0.07)

The effect of k, selection (top-k vs. random), and network dimension (top-k vs. k-dimensional):

Effect of k

To achieve speedups on GPUs, a slight change is made to unify the top-k pattern across the mini-batch. The original meProp will cause different top-k patterns across examples of a mini-batch, which will require sparse matrix multiplication. However, sparse matrix multiplication is not very efficient on GPUs compared to dense matrix multiplication on GPUs. Hence, by unifying the top-k pattern, we can extract the parts of the matrices that need computation (dense matrices), get the results, and reconstruct them to the appropriate size for further computation. This leads to actual speedups on GPUs, although we believe if a better method is designed, the speedups on GPUs can be better.

See [pdf] for more details, experimental results, and analysis.

Usage

PyTorch

Requirements

  • Python 3.5
  • PyTorch v0.1.12+ - v0.3.1
  • torchvision
  • CUDA 8.0

Dataset

MNIST: The code will automatically download the dataset and process the dataset (using torchvision). See function get_mnist in the pytorch code for more information.

Run

python3.5 main.py

The code runs unified meProp by default. You could change the lines at the bottom of the main.py to run meProp using sparse matrix multiplication. Or you could pass the arguments through command line.

usage: main.py [-h] [--n_epoch N_EPOCH] [--d_hidden D_HIDDEN]
               [--n_layer N_LAYER] [--d_minibatch D_MINIBATCH]
               [--dropout DROPOUT] [--k K] [--unified] [--no-unified]
               [--random_seed RANDOM_SEED]

optional arguments:
  -h, --help            show this help message and exit
  --n_epoch N_EPOCH     number of training epochs
  --d_hidden D_HIDDEN   dimension of hidden layers
  --n_layer N_LAYER     number of layers, including the output layer
  --d_minibatch D_MINIBATCH
                        size of minibatches
  --dropout DROPOUT     dropout rate
  --k K                 k in meProp (if invalid, e.g. 0, do not use meProp)
  --unified             use unified meProp
  --no-unified          do not use unified meProp
  --random_seed RANDOM_SEED
                        random seed

The results will be written to stdout by default, but you could change the argument file when initializing the TestGroup to write the results to a file.

The code supports simple unified meProp in addition. Please notice, this code will use GPU 0 by default.

C#

Requirements

  • Targeting Microsoft .NET Framework 4.6.1+
  • Compatible versions of Mono should work fine (tested Mono 5.0.1)
  • Developed with Microsoft Visual Studio 2017

Dataset

MNIST: Download from link. Extract the files, and place them at the same location with the executable.

Run

Compile the code first, or use the executable provided in releases.

Then

nnmnist.exe 

or

mono nnmnist.exe 

where is a configuration file. There is an example configuration file in the source codes. The example configuration file runs the baseline model. Change the NetType to mlptop for experimenting with meProp, and to mlpvar for experimenting with meSimp. The output will be written to a file at the same location with the executable.

The code supports random k selection in addition.

Citation

bibtex:

@InProceedings{sun17meprop,
  title = 	 {me{P}rop: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting},
  author = 	 {Xu Sun and Xuancheng Ren and Shuming Ma and Houfeng Wang},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {3299--3308},
  year = 	 {2017},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {International Convention Centre, Sydney, Australia}
}
You might also like...
[CVPR'21] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation
[CVPR'21] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation

MonoRUn MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation. CVPR 2021. [paper] Hansheng Chen, Yuyao Huang, Wei Tian*

Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation
Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Implicit Internal Video Inpainting Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation paper | project

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

This folder contains the implementation of the multi-relational attribute propagation algorithm.

MrAP This folder contains the implementation of the multi-relational attribute propagation algorithm. It requires the package pytorch-scatter. Please

STBP is a way to train SNN with datasets by Backward propagation.

Spiking neural network (SNN), compared with depth neural network (DNN), has faster processing speed, lower energy consumption and more biological interpretability, which is expected to approach Strong AI.

This is the official implementation of the paper
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation
[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Reliable Propagation-Correction Modulation for Video Object Segmentation (AAAI22) Preview version paper of this work is available at: https://arxiv.or

Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer)
Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer)

Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer) Introduction By applying the

Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"

BasicVSR_PlusPlus (CVPR 2022) [Paper] [Project Page] [Code] This is the official repository for BasicVSR++. Please feel free to raise issue related to

Comments
  • Regarding the demonstration for faster acceleration results in pytorch

    Regarding the demonstration for faster acceleration results in pytorch

    Hi lancopku,

    I'm currently implementing your meProp code to understand the flow of the architecture in detail.

    However, I couln't see the improved acceleration speed of meprop compared to that of conventional MLP.

    In the table 7 and 8 of paper Sun et al., 2017, pytorch based GPU computation can achieve more faster back-propagation procedure.

    Could you please let me know how to implement meprop to show faster backprop computation?

    Best, Seul-Ki

    opened by seulkiyeom 3
  • Deeper MLP?

    Deeper MLP?

    Have you tried on deeper models?

    Since each step of backprops, gradients are removed with specific portions(like 5%), Will not the gradient vanish in a deeper neural network model?

    Any thoughts?

    opened by ildoonet 1
  • Error RuntimeError: 2D tensors expected, got 1D

    Error RuntimeError: 2D tensors expected, got 1D

    I am trying to integrate meProp into my work, but getting such error. Do you have any idea about this?

        return linearUnified(self.k)(x, self.w, self.b)
     line 39, in forward
        y.addmm_(0, 1, x, w)
    RuntimeError: 2D tensors expected, got 1D, 2D tensors at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:258
    
    opened by kayuksel 1
Releases(v0.2.0)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels Official PyTorch Implementation of the paper Simple and Robust Loss Design

Xinyu Huang 28 Oct 27, 2022
CAUSE: Causality from AttribUtions on Sequence of Events

CAUSE: Causality from AttribUtions on Sequence of Events

Wei Zhang 21 Dec 01, 2022
Facestar dataset. High quality audio-visual recordings of human conversational speech.

Facestar Dataset Description Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a

Meta Research 87 Dec 21, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
🏖 Keras Implementation of Painting outside the box

Keras implementation of Image OutPainting This is an implementation of Painting Outside the Box: Image Outpainting paper from Standford University. So

Bendang 1.1k Dec 10, 2022
Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

Linus Ericsson 157 Dec 16, 2022
Turn based roguelike in python

pyTB Turn based roguelike in python Documentation can be found here: http://mcgillij.github.io/pyTB/index.html Screenshot Dependencies Written in Pyth

Jason McGillivray 4 Sep 29, 2022
A Transformer-Based Siamese Network for Change Detection

ChangeFormer: A Transformer-Based Siamese Network for Change Detection (Under review at IGARSS-2022) Wele Gedara Chaminda Bandara, Vishal M. Patel Her

Wele Gedara Chaminda Bandara 214 Dec 29, 2022
Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

nvdiffrec Joint optimization of topology, materials and lighting from multi-view image observations as described in the paper Extracting Triangular 3D

NVIDIA Research Projects 1.4k Jan 01, 2023
Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

ddpg-aigym Deep Deterministic Policy Gradient Implementation of Deep Deterministic Policy Gradiet Algorithm (Lillicrap et al.arXiv:1509.02971.) in Ten

Steven Spielberg P 247 Dec 07, 2022
LEDNet: A Lightweight Encoder-Decoder Network for Real-time Semantic Segmentation

LEDNet: A Lightweight Encoder-Decoder Network for Real-time Semantic Segmentation Table of Contents: Introduction Project Structure Installation Datas

Yu Wang 492 Dec 02, 2022
Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Wietse de Vries 5 Aug 02, 2021
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
Neural Network to colorize grayscale images

#colornet Neural Network to colorize grayscale images Results Grayscale Prediction Ground Truth Eiji K used colornet for anime colorization Sources Au

Pavel Hanchar 3.6k Dec 24, 2022
The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

4 Apr 20, 2022
A spatial genome aligner for analyzing multiplexed DNA-FISH imaging data.

jie jie is a spatial genome aligner. This package parses true chromatin imaging signal from noise by aligning signals to a reference DNA polymer model

Bojing Jia 9 Sep 29, 2022
Official PyTorch implementation of "The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation" (ICCV 21).

CenterGroup This the official implementation of our ICCV 2021 paper The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person P

Dynamic Vision and Learning Group 43 Dec 25, 2022
STBP is a way to train SNN with datasets by Backward propagation.

Spiking neural network (SNN), compared with depth neural network (DNN), has faster processing speed, lower energy consumption and more biological interpretability, which is expected to approach Stron

Ling Zhang 18 Dec 09, 2022
Benchmarks for the Optimal Power Flow Problem

Power Grid Lib - Optimal Power Flow This benchmark library is curated and maintained by the IEEE PES Task Force on Benchmarks for Validation of Emergi

A Library of IEEE PES Power Grid Benchmarks 207 Dec 08, 2022