Mish: Self Regularized
Non-Monotonic Activation Function

BMVC 2020 (Official Paper)

Notes: (Click to expand)

A considerably faster version based on CUDA can be found here - Mish CUDA (All credits to Thomas Brandon for the same)
Memory Efficient Experimental version of Mish can be found here
Faster variants for Mish and H-Mish by Yashas Samaga can be found here - ConvolutionBuildingBlocks
Alternative (experimental improved) variant of H-Mish developed by Páll Haraldsson can be found here - H-Mish (Available in Julia)
Variance based initialization method for Mish (experimental) by Federico Andres Lois can be found here - Mish_init

Changelogs/ Updates: (Click to expand)

[07/17] Mish added to OpenVino - Open-1187, Merged-1125
[07/17] Mish added to BetaML.jl
[07/17] Loss Landscape exploration progress in collaboration with Javier Ideami and Ajay Uppili Arasanipalai
[07/17] Poster accepted for presentation at DLRLSS hosted by MILA, CIFAR, Vector Institute and AMII
[07/20] Mish added to Google's AutoML - 502
[07/27] Mish paper accepted to 31st British Machine Vision Conference (BMVC), 2020. ArXiv version to be updated soon.
[08/13] New updated PyTorch benchmarks and pretrained models available on PyTorch Benchmarks.
[08/14] New updated Arxiv version of the paper is out.
[08/18] Mish added to Sony Nnabla - Merged-700
[09/02] Mish added to TensorFlow Swift APIs - Merged - 1068
[06/09] Official paper and presentation video for BMVC is released at this link.
[23/09] CSP-p7 + Mish (multi-scale) is currently the SOTA in Object Detection on MS-COCO test-dev while CSP-p7 + Mish (single-scale) is currently the 3rd best model in Object detection on MS-COCO test dev. Further details on paperswithcode leaderboards.
[11/11] Mish added to TFLearn - Merged 1159 (Follow up 1141)
[17/11] Mish added to MONAI - Merged 1235
[20/11] Mish added to plaidml - Merged 1566
[10/12] Mish added to Simd and Synet - Docs
[14/12] Mish added to OneFlow - Merged 3972
[24/12] Mish added to GPT-Neo
[21/04] Mish added to TensorFlow JS
[02/05] Mish added to Axon
[26/05] 🔥 Mish is added to PyTorch. Will be added in PyTorch 1.9. 🔥
[27/05] Mish is added to PyTorch YOLO v3
[09/06] 🔥 Mish is added to MXNet.
[03/07] Mish is added to TorchSharp.
[05/08] Mish is added to KotlinDL.

News/ Media Coverage:

(02/2020): Podcast episode on Mish at Machine Learning Café is out now. Listen on:

(02/2020): Talk on Mish and Non-Linear Dynamics at Sicara is out now. Watch on:

(07/2020): CROWN: A comparison of morphology for Mish, Swish and ReLU produced in collaboration with Javier Ideami. Watch on:

(08/2020): Talk on Mish and Non-Linear Dynamics at Computer Vision Talks. Watch on:

(12/2020): Talk on From Smooth Activations to Robustness to Catastrophic Forgetting at Weights & Biases Salon is out now. Watch on:

(12/2020) Weights & Biases integration is now added 🔥 . Get started.
(08/2021) Comprehensive hardware based computation performance benchmark for Mish has been conducted by Benjamin Warner. Blogpost.

MILA/ CIFAR 2020 DLRLSS (Click on arrow to view)

Contents: (Click to expand)

Mish
a. Loss landscape
ImageNet Scores
MS-COCO
Variation of Parameter Comparison
a. MNIST
b. CIFAR10
Significance Level
Results
a. Summary of Results (Vision Tasks)
b. Summary of Results (Language Tasks)
Try It!
Future Work
Acknowledgements
Cite this work

Mish:

$f(x) = x\tanh (softplus(x)) = x\tanh(\ln (1 + e^{x}))$

Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924
Mish has a parametric order of continuity of: C^∞

Derivative of Mish with respect to Swish and Δ(x) preconditioning:

$f'(x) = (sech^{2}(softplus(x)))(xsigmoid(x)) + \frac{f(x)}{x}$

Further simplifying:

$f'(x) = \Delta(x)swish(x) + \frac{f(x)}{x}$

Alternative derivative form:

$f'(x) = \frac{e^{x}\omega}{\delta^{2}}$

where:

$\omega = 4(x+1)+4e^{2x} +e^{3x} +e^{x}(4x+6)$

$\delta = 2e^{x} +e^{2x} +2$

We hypothesize the Δ(x) to be exhibiting the properties of a pre-conditioner making the gradient more smoother. Further details are provided in the paper.

Loss Landscape:

To visit the interactive Loss Landscape visualizer, click here.

Loss landscape visualizations for a ResNet-20 for CIFAR 10 using ReLU, Mish and Swish (from L-R) for 200 epochs training:

Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme.

We also investigate the output landscape of randomly initialized neural networks as shown below. Mish has a much smoother profile than ReLU.

ImageNet Scores:

For Installing DarkNet framework, please refer to darknet(Alexey AB)

For PyTorch based ImageNet scores, please refer to this readme

Network	Activation	Top-1 Accuracy	Top-5 Accuracy	cfg	Weights	Hardware
ResNet-50	Mish	74.244%	92.406%	cfg	weights	AWS p3.16x large, 8 Tesla V100
DarkNet-53	Mish	77.01%	93.75%	cfg	weights	AWS p3.16x large, 8 Tesla V100
DenseNet-201	Mish	76.584%	93.47%	cfg	weights	AWS p3.16x large, 8 Tesla V100
ResNext-50	Mish	77.182%	93.318%	cfg	weights	AWS p3.16x large, 8 Tesla V100

Network	Activation	Top-1 Accuracy	Top-5 Accuracy
CSPResNet-50	Leaky ReLU	77.1%	94.1%
CSPResNet-50	Mish	78.1%	94.2%

Pelee Net	Leaky ReLU	70.7%	90%
Pelee Net	Mish	71.4%	90.4%
Pelee Net	Swish	71.5%	90.7%

CSPPelee Net	Leaky ReLU	70.9%	90.2%
CSPPelee Net	Mish	71.2%	90.3%

Results on CSPResNext-50:

MixUp	CutMix	Mosaic	Blur	Label Smoothing	Leaky ReLU	Swish	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
					✔️			77.9%(=)	94%(=)
✔️					✔️			77.2%(-)	94%(=)
	✔️				✔️			78%(+)	94.3%(+)
		✔️			✔️			78.1%(+)	94.5%(+)
			✔️		✔️			77.5%(-)	93.8%(-)
				✔️	✔️			78.1%(+)	94.4%(+)
						✔️		64.5%(-)	86%(-)
							✔️	78.9%(+)	94.5%(+)
	✔️	✔️		✔️	✔️			78.5%(+)	94.8%(+)
	✔️	✔️		✔️			✔️	79.8%(+)	95.2%(+)	cfg	weights

Results on CSPResNet-50:

CutMix	Mosaic	Label Smoothing	Leaky ReLU	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️		76.6%(=)	93.3%(=)
✔️	✔️	✔️	✔️		77.1%(+)	94.1%(+)
✔️	✔️	✔️		✔️	78.1%(+)	94.2%(+)	cfg	weights

Results on CSPDarkNet-53:

CutMix	Mosaic	Label Smoothing	Leaky ReLU	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️		77.2%(=)	93.6%(=)
✔️	✔️	✔️	✔️		77.8%(+)	94.4%(+)
✔️	✔️	✔️		✔️	78.7%(+)	94.8%(+)	cfg	weights

Results on SpineNet-49:

CutMix	Mosaic	Label Smoothing	ReLU	Swish	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️			77%(=)	93.3%(=)	-	-
		✔️		✔️		78.1%(+)	94%(+)	-	-
✔️	✔️	✔️			✔️	78.3%(+)	94.6%(+)	-	-

MS-COCO:

For PyTorch based MS-COCO scores, please refer to this readme

Model	Mish	AP50...95	mAP50	CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPS	VPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPS	GPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS
CSPDarkNet-53 (512 x 512)		42.4%	64.5%	3.5	1.23	43
CSPDarkNet-53 (512 x 512)	✔️	43%	64.9%	-	-	41
CSPDarkNet-53 (608 x 608)	✔️	43.5%	65.7%	-	-	26

Architecture	Mish	CutMix	Mosaic	Label Smoothing	Size	AP	AP50	AP75
CSPResNext50-PANet-SPP					512 x 512	42.4%	64.4%	45.9%
CSPResNext50-PANet-SPP		✔️	✔️	✔️	512 x 512	42.3%	64.3%	45.7%
CSPResNext50-PANet-SPP	✔️	✔️	✔️	✔️	512 x 512	42.3%	64.2%	45.8%

CSPDarkNet53-PANet-SPP		✔️	✔️	✔️	512 x 512	42.4%	64.5%	46%
CSPDarkNet53-PANet-SPP	✔️	✔️	✔️	✔️	512 x 512	43%	64.9%	46.5%

Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.

Variation of Parameter Comparison:

MNIST:

To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.

The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.

Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)

CIFAR10:

Significance Level:

The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 23 runs, specifically it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across 23 runs which proves the consistency of performance for Mish.

Activation Function	Mean Accuracy	Mean Loss	Standard Deviation of Accuracy	P-value	Cohen's d Score	95% CI
Mish	87.48%	4.13%	0.3967	-	-	-
Swish-1	87.32%	4.22%	0.414	P = 0.1973	0.386	-0.3975 to 0.0844
E-Swish (β=1.75)	87.49%	4.156%	0.411	P = 0.9075	0.034444	-0.2261 to 0.2539
GELU	87.37%	4.339%	0.472	P = 0.4003	0.250468	-0.3682 to 0.1499
ReLU	86.66%	4.398%	0.584	P < 0.0001	1.645536	-1.1179 to -0.5247
ELU(α=1.0)	86.41%	4.211%	0.3371	P < 0.0001	2.918232	-1.2931 to -0.8556
Leaky ReLU(α=0.3)	86.85%	4.112%	0.4569	P < 0.0001	1.47632	-0.8860 to -0.3774
RReLU	86.87%	4.138%	0.4478	P < 0.0001	1.444091	-0.8623 to -0.3595
SELU	83.91%	4.831%	0.5995	P < 0.0001	7.020812	-3.8713 to -3.2670
SoftPlus(β = 1)	83.004%	5.546%	1.4015	P < 0.0001	4.345453	-4.7778 to -4.1735
HardShrink(λ = 0.5)	75.03%	7.231%	0.98345	P < 0.0001	16.601747	-12.8948 to -12.0035
Hardtanh	82.78%	5.209%	0.4491	P < 0.0001	11.093842	-4.9522 to -4.4486
LogSigmoid	81.98%	5.705%	1.6751	P < 0.0001	4.517156	-6.2221 to -4.7753
PReLU	85.66%	5.101%	2.2406	P = 0.0004	1.128135	-2.7715 to -0.8590
ReLU6	86.75%	4.355%	0.4501	P < 0.0001	1.711482	-0.9782 to -0.4740
CELU(α=1.0)	86.23%	4.243%	0.50941	P < 0.0001	2.741669	-1.5231 to -0.9804
Sigmoid	74.82%	8.127%	5.7662	P < 0.0001	3.098289	-15.0915 to -10.2337
Softshrink(λ = 0.5)	82.35%	5.4915%	0.71959	P < 0.0001	8.830541	-5.4762 to -4.7856
Tanhshrink	82.35%	5.446%	0.94508	P < 0.0001	7.083564	-5.5646 to -4.7032
Tanh	83.15%	5.161%	0.6887	P < 0.0001	7.700198	-4.6618 to -3.9938
Softsign	82.66%	5.258%	0.6697	P < 0.0001	8.761157	-5.1493 to -4.4951
Aria-2(β = 1, α=1.5)	81.31%	6.0021%	2.35475	P < 0.0001	3.655362	-7.1757 to -5.1687
Bent's Identity	85.03%	4.531%	0.60404	P < 0.0001	4.80211	-2.7576 to -2.1502
SQNL	83.44%	5.015%	0.46819	P < 0.0001	9.317237	-4.3009 to -3.7852
ELisH	87.38%	4.288%	0.47731	P = 0.4283	0.235784	-0.3643 to 0.1573
Hard ELisH	85.89%	4.431%	0.62245	P < 0.0001	3.048849	-1.9015 to -1.2811
SReLU	85.05%	4.541%	0.5826	P < 0.0001	4.883831	-2.7306 to -2.1381
ISRU (α=1.0)	86.85%	4.669%	0.1106	P < 0.0001	5.302987	-4.4855 to -3.5815
Flatten T-Swish	86.93%	4.459%	0.40047	P < 0.0001	1.378742	-0.7865 to -0.3127
SineReLU (ε = 0.001)	86.48%	4.396%	0.88062	P < 0.0001	1.461675	-1.4041 to -0.5924
Weighted Tanh (Weight = 1.7145)	80.66%	5.985%	1.19868	P < 0.0001	7.638298	-7.3502 to -6.2890
LeCun's Tanh	82.72%	5.322%	0.58256	P < 0.0001	9.551812	-5.0566 to -4.4642
Soft Clipping (α=0.5)	55.21%	18.518%	10.831994	P < 0.0001	4.210373	-36.8255 to -27.7154
ISRLU (α=1.0)	86.69%	4.231%	0.5788	P < 0.0001	1.572874	-1.0753 to -0.4856

Values rounded up which might cause slight deviation in the statistical values reproduced from these tests

Results:

News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.

Summary of Results (Vision Tasks):

Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.

Activation Function	Mish > Baseline Model	Mish < Baseline Model
ReLU	55	20
Swish-1	53	22
SELU	26	1
Sigmoid	24	0
TanH	24	0
HardShrink(λ = 0.5)	23	0
Tanhshrink	23	0
PReLU(Default Parameters)	23	2
Softsign	22	1
Softshrink (λ = 0.5)	22	1
Hardtanh	21	2
ELU(α=1.0)	21	7
LogSigmoid	20	4
GELU	19	3
E-Swish (β=1.75)	19	7
CELU(α=1.0)	18	5
SoftPlus(β = 1)	17	7
Leaky ReLU(α=0.3)	17	8
Aria-2(β = 1, α=1.5)	16	2
ReLU6	16	8
SQNL	13	1
Weighted TanH (Weight = 1.7145)	12	1
RReLU	12	11
ISRU (α=1.0)	11	1
Le Cun's TanH	10	2
Bent's Identity	10	5
Hard ELisH	9	1
Flatten T-Swish	9	3
Soft Clipping (α=0.5)	9	3
SineReLU (ε = 0.001)	9	4
ISRLU (α=1.0)	9	4
ELisH	7	3
SReLU	7	6
Hard Sigmoid	1	0
Thresholded ReLU(θ=1.0)	1	0

Summary of Results (Language Tasks):

Comparison is done based on the best metric score (Test accuracy) across 3 runs.

Activation Function	Mish > Baseline Model	Mish < Baseline Model
Penalized TanH	5	0
ELU	5	0
Sigmoid	5	0
SReLU	4	0
TanH	4	1
Swish	3	2
ReLU	2	3
Leaky ReLU	2	3
GELU	1	2

Try It!

Torch	DarkNet	Julia	FastAI	TensorFlow	Keras	CUDA
Source	Source	Source	Source	Source	Source	Source

Future Work: (Click to view)

Comparison of Convergence Rates.
Normalizing constant for Mish to eliminate the use of Batch Norm.
Regularizing effect of the first derivative of Mish with repect to Swish.

Acknowledgments: (Click to expand)

Thanks to all the people who have helped and supported me massively through this project who include:

And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.

Cite this work:

@article{misra2019mish,
  title={Mish: A self regularized non-monotonic neural activation function},
  author={Misra, Diganta},
  journal={arXiv preprint arXiv:1908.08681},
  year={2019}
}

Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Related tags

Overview

Mish: Self Regularized
Non-Monotonic Activation Function

News/ Media Coverage:

Mish:

Loss Landscape:

ImageNet Scores:

MS-COCO:

Variation of Parameter Comparison:

MNIST:

CIFAR10:

Significance Level:

Results:

Summary of Results (Vision Tasks):

Summary of Results (Language Tasks):

Try It!

Cite this work:

Owner

Xa9aX ツ

NeWT: Natural World Tasks

Black box hyperparameter optimization made easy.

Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Resources for our AAAI 2022 paper: "LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification".

An interactive DNN Model deployed on web that predicts the chance of heart failure for a patient with an accuracy of 98%

Pre-trained models for a Cascaded-FCN in caffe and tensorflow that segments

A simple log parser and summariser for IIS web server logs

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Multi-Stage Spatial-Temporal Convolutional Neural Network (MS-GCN)

Example of a Quantum LSTM

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

EssentialMC2 Video Understanding

Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)

A new codebase for Group Activity Recognition. It contains codes for ICCV 2021 paper: Spatio-Temporal Dynamic Inference Network for Group Activity Recognition and some other methods.

Accuracy Aligned. Concise Implementation of Swin Transformer

OMAMO: orthology-based model organism selection

Code for ICCV2021 paper SPEC: Seeing People in the Wild with an Estimated Camera

Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021

Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Related tags

Overview

Mish: Self Regularized Non-Monotonic Activation Function

News/ Media Coverage:

Mish:

Loss Landscape:

ImageNet Scores:

MS-COCO:

Variation of Parameter Comparison:

MNIST:

CIFAR10:

Significance Level:

Results:

Summary of Results (Vision Tasks):

Summary of Results (Language Tasks):

Try It!

Cite this work:

Owner

Xa9aX ツ

NeWT: Natural World Tasks

Black box hyperparameter optimization made easy.

Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Resources for our AAAI 2022 paper: "LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification".

An interactive DNN Model deployed on web that predicts the chance of heart failure for a patient with an accuracy of 98%

Pre-trained models for a Cascaded-FCN in caffe and tensorflow that segments

A simple log parser and summariser for IIS web server logs

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Multi-Stage Spatial-Temporal Convolutional Neural Network (MS-GCN)

Example of a Quantum LSTM

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

EssentialMC2 Video Understanding

Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)

A new codebase for Group Activity Recognition. It contains codes for ICCV 2021 paper: Spatio-Temporal Dynamic Inference Network for Group Activity Recognition and some other methods.

Accuracy Aligned. Concise Implementation of Swin Transformer

OMAMO: orthology-based model organism selection

Code for ICCV2021 paper SPEC: Seeing People in the Wild with an Estimated Camera

Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021

Mish: Self Regularized
Non-Monotonic Activation Function