[v1 (ISBI'21) + v2] MedMNIST: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification

Overview

MedMNIST

Project (Website) | Dataset (Zenodo) | Paper (arXiv) | MedMNIST v1 (ISBI'21)

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni

We introduce MedMNIST v2, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D / 3D neural networks and open-source / commercial AutoML tools.

MedMNISTv2_overview

For more details, please refer to our paper:

MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification (arXiv)

Key Features

  • Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
  • Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST v2, therefore algorithms could be easily compared.
  • Lightweight: The small size of 28×28 (2D) or 28×28×28 (3D) is friendly to evaluate machine learning algorithms.
  • Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with Creative Commons (CC) Licenses is easy to use for educational purposes.

Please note that this dataset is NOT intended for clinical use.

Code Structure

  • medmnist/:
    • dataset.py: PyTorch datasets and dataloaders of MedMNIST.
    • evaluator.py: Standardized evaluation functions.
    • info.py: Dataset information dict for each subset of MedMNIST.
  • examples/:
    • getting_started.ipynb: To explore the MedMNIST dataset with jupyter notebook. It is ONLY intended for a quick exploration, i.e., it does not provide full training and evaluation functionalities.
    • getting_started_without_PyTorch.ipynb: This notebook provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • setup.py: To install medmnist as a module.
  • [EXTERNAL] MedMNIST/experiments: training and evaluation scripts to reproduce both 2D and 3D experiments in our paper, including PyTorch, auto-sklearn, AutoKeras and Google AutoML Vision together with their weights ;)

Installation and Requirements

Setup the required environments and install medmnist as a standard Python package:

pip install --upgrade git+https://github.com/MedMNIST/MedMNIST.git

Check whether you have installed the latest version:

>>> import medmnist
>>> print(medmnist.__version__)

The code requires only common Python environments for machine learning. Basically, it was tested with

  • Python 3 (Anaconda 3.6.3 specifically)
  • PyTorch==1.3.1
  • numpy==1.18.5, pandas==0.25.3, scikit-learn==0.22.2, Pillow==8.0.1, fire

Higher (or lower) versions should also work (perhaps with minor modifications).

If you use PyTorch

  • Great! Our code is designed to work with PyTorch.

  • Explore the MedMNIST dataset with jupyter notebook (getting_started.ipynb), and train basic neural networks in PyTorch.

If you do not use PyTorch

  • Although our code is tested with PyTorch, you are free to parse them with your own code (without PyTorch or even without Python!), as they are only standard NumPy serialization files. It is simple to create a dataset without PyTorch.
  • Go to getting_started_without_PyTorch.ipynb, which provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • Simply change the super class of MedMNIST from torch.utils.data.Dataset to collections.Sequence, you will get a standard dataset without PyTorch. Check dataset_without_pytorch.py for more details.
  • You still have most functionality of our MedMNIST code ;)

Dataset

Please download the dataset(s) via Zenodo. You could also use our code to download automatically by setting download=True in dataset.py.

The MedMNIST dataset contains several subsets. Each subset (e.g., pathmnist.npz) is comprised of 6 keys: train_images, train_labels, val_images, val_labels, test_images and test_labels.

  • train_images / val_images / test_images: N × 28 × 28 for 2D gray-scale datasets, N × 28 × 28 × 3 for 2D RGB datasets, N × 28 × 28 × 28 for 3D datasets. N denotes the number of samples.
  • train_labels / val_labels / test_labels: N x L. N denotes the number of samples. L denotes the number of task labels; for single-label (binary/multi-class) classification, L=1, and {0,1,2,3,..,C} denotes the category labels (C=1 for binary); for multi-label classification L!=1, e.g., L=14 for chestmnist.npz.

Command Line Tools

  • List all available datasets:

      python -m medmnist available
    
  • Download all available datasets:

      python -m medmnist download
    
  • Delete all downloaded npz from root:

      python -m medmnist clean
    
  • Print the dataset details given a subset flag:

      python -m medmnist info --flag=xxxmnist
    
  • Save the dataset as standard figure and csv files, which could be used for AutoML tools, e.g., Google AutoML Vision:

      python -m medmnist save --flag=xxxmnist --folder=tmp/
    
  • Parse and evaluate a standard result file, refer to Evaluator.parse_and_evaluate for details.

      python -m medmnist evaluate --path=folder/{flag}_{split}@{run}.csv
    

Citation

If you find this project useful, please cite both v1 and v2 paper as:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021.

Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using the bibtex:

@article{medmnistv2,
    title={MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification},
    author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing},
    journal={arXiv preprint arXiv:2110.14795},
    year={2021}
}
 
@inproceedings{medmnistv1,
    title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis},
    author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing},
    booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)},
    pages={191--195},
    year={2021}
}

Please also cite the corresponding paper of source data if you use any subset of MedMNIST as per the project page.

LICENSE

The code is under Apache-2.0 License.

The datasets are under Creative Commons (CC) Licenses in general. Each subset keeps the same license as that of the source dataset.

This repository collects 100 papers related to negative sampling methods.

Negative-Sampling-Paper This repository collects 100 papers related to negative sampling methods, covering multiple research fields such as Recommenda

RUCAIBox 119 Dec 29, 2022
A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Aladdin Persson 4.7k Jan 08, 2023
Code for the tech report Toward Training at ImageNet Scale with Differential Privacy

Differentially private Imagenet training Code for the tech report Toward Training at ImageNet Scale with Differential Privacy by Alexey Kurakin, Steve

Google Research 29 Nov 03, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022
Real-time VIBE: Frame by Frame Inference of VIBE (Video Inference for Human Body Pose and Shape Estimation)

Real-time VIBE Inference VIBE frame-by-frame. Overview This is a frame-by-frame inference fork of VIBE at [https://github.com/mkocabas/VIBE]. Usage: i

23 Jul 02, 2022
Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Syne Tune This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend option

Amazon Web Services - Labs 236 Jan 01, 2023
SimDeblur is a simple framework for image and video deblurring, implemented by PyTorch

SimDeblur (Simple Deblurring) is an open source framework for image and video deblurring toolbox based on PyTorch, which contains most deep-learning based state-of-the-art deblurring algorithms. It i

220 Jan 07, 2023
Crab is a flexible, fast recommender engine for Python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (numpy, scipy, matplotlib).

Crab - A Recommendation Engine library for Python Crab is a flexible, fast recommender engine for Python that integrates classic information filtering r

python-recsys 1.2k Dec 21, 2022
Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

Pytorch Squeeznet Pytorch implementation of Squeezenet model as described in https://arxiv.org/abs/1602.07360 on cifar-10 Data. The definition of Sque

gaurav pathak 86 Oct 28, 2022
Gradient representations in ReLU networks as similarity functions

Gradient representations in ReLU networks as similarity functions by Dániel Rácz and Bálint Daróczy. This repo contains the python code related to our

1 Oct 08, 2021
RobustVideoMatting and background composing in one model by using onnxruntime.

RVM_onnx_compose RobustVideoMatting and background composing in one model by using onnxruntime. Usage pip install -r requirements.txt python infer_cam

Quantum Liu 4 Apr 07, 2022
Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

Yvictor 1.1k Jan 02, 2023
A python module for configuration of block devices

Blivet is a python module for system storage configuration. CI status Licence See COPYING Installation From Fedora repositories Blivet is available in

78 Dec 14, 2022
RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation Anonymous submission Abstract 3D obj

30 Sep 16, 2022
Constructing interpretable quadratic accuracy predictors to serve as an objective function for an IQCQP problem that represents NAS under latency constraints and solve it with efficient algorithms.

IQNAS: Interpretable Integer Quadratic programming Neural Architecture Search Realistic use of neural networks often requires adhering to multiple con

0 Oct 24, 2021
Music source separation is a task to separate audio recordings into individual sources

Music Source Separation Music source separation is a task to separate audio recordings into individual sources. This repository is an PyTorch implmeme

Bytedance Inc. 958 Jan 03, 2023
Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation

Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation Paper Multi-Target Adversarial Frameworks for Domain Adaptation in

Valeo.ai 20 Jun 21, 2022
Locally cache assets that are normally streamed in POPULATION: ONE

Population One Localizer This is no longer needed as of the build shipped on 03/03/22, thank you bigbox :) Locally cache assets that are normally stre

Ahman Woods 2 Mar 04, 2022
Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Install first pip3 install -e . Training python3 training/unsupervised_tuning.py python3 training/supervised_tuning.py python3 training/multilingual_

yanzhang_nlp 26 Jul 22, 2022
Neural Network to colorize grayscale images

#colornet Neural Network to colorize grayscale images Results Grayscale Prediction Ground Truth Eiji K used colornet for anime colorization Sources Au

Pavel Hanchar 3.6k Dec 24, 2022