The pure and clear PyTorch Distributed Training Framework.

Last update: Dec 20, 2022

Overview

The pure and clear PyTorch Distributed Training Framework.

Introduction
Requirements and Usage
Baselines
Zombie processes problem
Acknowledgments
Citation

Introduction

Distribuuuu is a Distributed Classification Training Framework powered by native PyTorch.

Please check tutorial for detailed Distributed Training tutorials:

Single Node Single GPU Card Training [snsc.py]
Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
- Slurm Workload Manager [mnmc_ddp_slurm.py]
ImageNet training example [imagenet.py]

For the complete training framework, please see distribuuuu.

Requirements and Usage

Dependency

Install PyTorch>= 1.6 (has been tested on 1.6, 1.7.1, 1.8 and 1.8.1)
Install other dependencies: pip install -r requirements.txt

Dataset

Download the ImageNet dataset and move validation images to labeled subfolders, using the script valprep.sh.

Expected datasets structure for ILSVRC

ILSVRC
|_ train
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ val
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ ...

Create a directory containing symlinks:

mkdir -p /path/to/distribuuuu/data

Symlink ILSVRC:

ln -s /path/to/ILSVRC /path/to/distribuuuu/data/ILSVRC

Basic Usage

Single Node with one task

# 1 node, 8 GPUs
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Distribuuuu use yacs, a elegant and lightweight package to define and manage system configurations. You can setup config via a yaml file, and overwrite by other opts. If the yaml is not provided, the default configuration file will be used, please check distribuuuu/config.py.

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml \
    OUT_DIR /tmp \
    MODEL.SYNCBN True \
    TRAIN.BATCH_SIZE 256

# --cfg config/resnet18.yaml parse config from file
# OUT_DIR /tmp            overwrite OUT_DIR
# MODEL.SYNCBN True       overwrite MODEL.SYNCBN
# TRAIN.BATCH_SIZE 256    overwrite TRAIN.BATCH_SIZE

Single Node with two tasks

# 1 node, 2 task, 4 GPUs per task (8GPUs)
# task 1:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# task 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Multiple Nodes Training

# 2 node, 8 GPUs per node (16GPUs)
# node 1:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# node 2:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Slurm Cluster Usage

# see srun --help 
# and https://slurm.schedmd.com/ for details

# example: 64 GPUs
# batch size = 64 * 128 = 8192
# itertaion = 128k / 8192 = 156 
# lr = 64 * 0.1 = 6.4

srun --partition=openai-a100 \
     -n 64 \
     --gres=gpu:8 \
     --ntasks-per-node=8 \
     --job-name=Distribuuuu \
     python -u train_net.py --cfg config/resnet18.yaml \
     TRAIN.BATCH_SIZE 128 \
     OUT_DIR ./resnet18_8192bs \
     OPTIM.BASE_LR 6.4

Baselines

Baseline models trained by Distribuuuu:

We use SGD with momentum of 0.9, a half-period cosine schedule, and train for 100 epochs.
We use a reference learning rate of 0.1 and a weight decay of 5e-5 (1e-5 For EfficientNet).
The actual learning rate(Base LR) for each model is computed as (batch-size / 128) * reference-lr.
Only standard data augmentation techniques(RandomResizedCrop and RandomHorizontalFlip) are used.

PS: use other robust tricks(more epochs, efficient data augmentation, etc.) to get better performance.

Arch	Params(M)	Total batch	Base LR	[email protected]	[email protected]	model / config
resnet18	11.690	256 (32*8GPUs)	0.2	70.902	89.894	Drive / cfg
resnet18	11.690	1024 (128*8GPUs)	0.8	70.994	89.892
resnet18	11.690	8192 (128*64GPUs)	6.4	70.165	89.374
resnet18	11.690	16384 (256*64GPUs)	12.8	68.766	88.381
efficientnet_b0	5.289	512 (64*8GPUs)	0.4	74.540	91.744	Drive / cfg
resnet50	25.557	256 (32*8GPUs)	0.2	77.252	93.430	Drive / cfg
botnet50	20.859	256 (32*8GPUs)	0.2	77.604	93.682	Drive / cfg
regnetx_160	54.279	512 (64*8GPUs)	0.4	79.992	95.118	Drive / cfg
regnety_160	83.590	512 (64*8GPUs)	0.4	80.598	95.090	Drive / cfg
regnety_320	145.047	512 (64*8GPUs)	0.4	80.824	95.276	Drive / cfg

Zombie processes problem

Before PyTorch1.8, torch.distributed.launch will leave some zombie processes after using Ctrl + C, try to use the following cmd to kill the zombie processes. (fairseq/issues/487):

kill $(ps aux | grep YOUR_SCRIPT.py | grep -v grep | awk '{print $2}')

PyTorch >= 1.8 is suggested, which fixed the issue about zombie process. (pytorch/pull/49305)

Acknowledgments

Provided codes were adapted from:

I strongly recommend you to choose pycls, a brilliant image classification codebase and adopted by a number of projects at Facebook AI Research.

Citation

@misc{bigballon2021distribuuuu,
  author = {Wei Li},
  title = {Distribuuuu: The pure and clear PyTorch Distributed Training Framework},
  howpublished = {\url{https://github.com/BIGBALLON/distribuuuu}},
  year = {2021}
}

Feel free to contact me if you have any suggestions or questions, issues are welcome, create a PR if you find any bugs or you want to contribute. 🍰

The pure and clear PyTorch Distributed Training Framework.

Related tags

Overview

Introduction

Requirements and Usage

Dependency

Dataset

Basic Usage

Slurm Cluster Usage

Baselines

Zombie processes problem

Acknowledgments

Citation

Owner

WILL LEE

A PyTorch implementation for V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Referring Video Object Segmentation

Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

Model that predicts the probability of a Twitter user being anti-vaccination.

Clustering is a popular approach to detect patterns in unlabeled data

Just-Now - This Is Just Now Login Friendlist Cloner Tools

Bayesian Image Reconstruction using Deep Generative Models

Expressive Power of Invariant and Equivaraint Graph Neural Networks (ICLR 2021)

A public available dataset for road boundary detection in aerial images

A fast model to compute optical flow between two input images.

MINOS: Multimodal Indoor Simulator

Code base for "On-the-Fly Test-time Adaptation for Medical Image Segmentation"

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Pytorch implementation of Depth-conditioned Dynamic Message Propagation forMonocular 3D Object Detection

Pytorch reimplementation of PSM-Net: "Pyramid Stereo Matching Network"

UltraGCN: An Ultra Simplification of Graph Convolutional Networks for Recommendation

Code for "Learning Graph Cellular Automata"

A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more

A toolkit for developing and comparing reinforcement learning algorithms.

A Japanese Medical Information Extraction Toolkit