Secure Distributed Training at Scale

Last update: Jul 11, 2022

Related tags

Deep Learning btard

Overview

Secure Distributed Training at Scale

This repository contains the implementation of experiments from the paper

"Secure Distributed Training at Scale"

Eduard Gorbunov*, Alexander Borzunov*, Michael Diskin, Max Ryabinin

[PDF] arxiv.org

Overview

The code is organized as follows:

./resnet is a setup for training ResNet18 on CIFAR-10 with simulated byzantine attackers
./albert runs distributed training of ALBERT-large with byzantine attacks using cloud instances

ResNet18

This setup uses torch.distributed for parallelism.

Requirements

Python >= 3.7 (we recommend Anaconda python 3.8)
Dependencies: pip install jupyter torch>=1.6.0 torchvision>=0.7.0 tensorboard
A machine with at least 16GB RAM and either a GPU with >24GB memory or 3 GPUs with at least 10GB memory each.
We tested the code on Ubuntu Server 18.04, it should work with all major linux distros. For Windows, we recommend using Docker (e.g. via Kitematic).

Running experiments: please open ./resnet/RunExperiments.ipynb and follow the instructions in that notebook. The learning curves will be available in Tensorboard logs: tensorboard --logdir btard/resnet.

ALBERT

This setup spawns distributed nodes that collectively train ALBERT-large on wikitext103. It uses a version of the hivemind library modified so that some peers may be programmed to become Byzantine and perform various types of attacks on the training process.

Requirements

The experiments are optimized for 16 instances each with a single T4 GPU.
- For your convenience, we provide a cost-optimized AWS starter notebook that can run experiments (see below)
- While it can be simulated with a single node, doing so will require additional tuning depending on the number and type of GPUs available.
If running manually, please install the core library on each machine:
- The code requires python >= 3.7 (we recommend Anaconda python 3.8)
- Install the library: cd ./albert/hivemind/ && pip install -e .
- If successful, it should become available as import hivemind

Running experiments: For your convenience, we provide a unified script that runs a distributed ALBERT experiment in the AWS cloud ./albert/experiments/RunExperiments.ipynb using preemptible T4 instances. The learning curves will be posted to the Wandb project specified during the notebook setup.

Expected cloud costs: a training experiment with 16 hosts takes up approximately $60 per day for g4dn.xlarge and $90 per day for g4dn.2xlarge instances. One can expect a full training experiment to converge in ≈3 days. Once the model is trained, one can restart training from intermediate checkpoints and simulate attacks. One attack episode takes up 4-5 hours depending on cloud availability.

Secure Distributed Training at Scale

Related tags

Overview

Secure Distributed Training at Scale

Overview

ResNet18

Requirements

ALBERT

Requirements

Owner

Yandex Research

JstDoS - HTTP Protocol Stack Remote Code Execution Vulnerability

Learning to Communicate with Deep Multi-Agent Reinforcement Learning in PyTorch

Code for PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Relighting and Material Editing

The official re-implementation of the Neurips 2021 paper, "Targeted Neural Dynamical Modeling".

A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

Rational Activation Functions - Replacing Padé Activation Units

Data, model training, and evaluation code for "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

Final report with code for KAIST Course KSE 801.

Code for ACL2021 long paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

Bounding Wasserstein distance with couplings

TCTrack: Temporal Contexts for Aerial Tracking (CVPR2022)

This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

[CVPR-2021] UnrealPerson: An adaptive pipeline for costless person re-identification

[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Trainable Bilateral Filter Layer (PyTorch)

Practical and Real-world applications of ML based on the homework of Hung-yi Lee Machine Learning Course 2021

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

PyTorch implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets