Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Last update: Jan 09, 2023

Overview

COResets and Data Subset selection

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of pytorch. Deep Learning systems are extremely compute intensive today with large turn around times, energy inefficiencies, higher costs and resourse requirements [1,2]. CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the right representative data subsets from massive datasets, and it does so iteratively. CORDS uses some recent advances in data subset selection and particularly, ideas of coresets and submodularity select such subsets. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include:

GLISTER [3]
GradMatch [4]
CRAIG [4,5]
SubmodularSelection [6,7,8] (Facility Location, Feature Based Functions, Coverage, Diversity)
RandomSelection

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

Reproducability of SOTA in Data Selection and Coresets: Enable easy reproducability of SOTA described above. We are trying to also add more algorithms so if you have an algorithm you would like us to include, please let us know,.
Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets including CIFAR-10, CIFAR-100, MNIST, SVHN and ImageNet.
Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
Modular design: The data selection algorithms are separate from the training loop, thereby enabling modular design and also varied scenarios of utility.
Broad number of usecases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating a number of additional use cases like object detection, speech recognition, semi-supervised learning, Auto-ML, etc.

Installation

To install latest version of CORDS package using PyPI:

pip install -i https://test.pypi.org/simple/ cords

To install using source:

git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt

Next Steps

Tutorials

Documentation

The documentation for the latest version of CORDS can always be found here.

Comments

Logistic Regression support for Gradmatch

Logistic Regression model throws errors when we do back propagation. The fix for this is perhaps making freeze=False in forward function of utils/models/logreg_net.py

opened by nlokeshiisc 4
[Bug] Got weight with same value when running examples.
Hi, I tested the example with Supervised learning and Glister strategy. https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb But when I print the weight of the train loader, they are all 1.0. I believe that by using Glister strategy, we will get different weights.

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')

Is that a bug or something special? Thanks.
opened by HaoKang-Timmy 3
Segmentation fault (core dumped)

Hi,

I was trying to deploy CORDS selection to my training, but this error popped out Segmentation fault (core dumped).

I imitated code from https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb.

So basically I put my training and testing loader into GLISTERDataLoader, and switched this part into my code

for _, (inputs, targets, weights) in enumerate(dataloader): inputs = inputs.to(device) targets = targets.to(device, non_blocking=True) weights = weights.to(device) optimizer.zero_grad() outputs = model(inputs) losses = criterion_nored(outputs, targets) loss = torch.dot(losses, weights/(weights.sum())) loss.backward()

before modifying my code was running fine, so I believe there is an error inside the CORDS, my dataset is CIFAR10.

Thanks

opened by chengwuxinlin 2
Replace apricot with submodlib
Fixes #16
submodlib is now used for the CRAIG strategy/dataloader as well as the submodular strategy/dataloader. Please let me know if you have any feedback!

Notes:

I am not sure if sum redundancy (a submodular function implemented in apricot) has an analogue in submodlib, so it is disabled as an option for now.

It doesn't seem like submodularselectionstrategy.py is used in the corresponding dataloader. This may be a good opportunity to refactor, so that behavior is consistent between the two.

Any existing code that specifies the "optimizer" (greedy algorithm) used by apricot will break, since the names used by submodlib are different than those used by apricot (e.g. 'LazyGreedy' instead of 'Lazy'). This includes configs that use this option.
opened by ghost 1

Typo in cords_cifar10_glister_train.ipynb

There is a typo in the cords_cifar10_glister_train.ipynb notebook : https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/cords_cifar10_glister_train.ipynb

glister_trn.configdata.train_args.print_every = 1
glister_trn.configdata.train_args.device = 'cuda'
glister_trn.configdata.dss_args.fraction = fraction

instead of

glister_trn.cfg.train_args.print_every = 1
glister_trn.cfg.train_args.device = 'cuda'
glister_trn.cfg.dss_args.fraction = fraction

opened by eendee 1

Evaluation on ImageNet

Hello, thanks for a very interesting and useful project.

Could you mind providing an evaluation method for ImageNet? I tried to, adding loader for ImageNet to custom_dataset.py, but failed due to a GPU memory issue during subset selection.

Many thanks!

opened by Hayoung93 1
For GRAD_MATCH method, the weights associated with each data point in X(subset of training set)
For GRAD-MATCH method, there are weights associated with each data point in X(subset of training set). Do the weights have physical significance? for example, if the value of the weight is higher, the relevant selected data has the greater contribution to the residual?

During the iteration, the selective index is in the selected indices, so the iteration break. why this happen? [email protected]
opened by lishaguo 1
Questions about accuracy logging

Hello! Thanks for your great work.

I'm currently working on this code and I want to ask a question about accuracy logging.

https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L530-L541

In line 541 of train.py, val_acc contains cumulative accuracies over input batches. For example, if the loader contains 4500 examples and the batch size is 1000, then tst_acc has 5 accuracies per each evaluation. (the first element of tst_acc will be the accuracy over the first 1000 examples)

https://github.com/decile-team/cords/blob/ff629ff15fac911cd3b82394ffd278c42dacd874/train.py#L631-L633

In line 633, it prints the best value in tst_acc. In this case, the resulted best accuracies over different algorithms and seeds might be the values evaluated on different test samples.

Is this what you intended? In my experience, I think evaluating algorithms on an identical test dataset is a convention. In addition, is the reported test accuracies in the GRAD-MATCH paper the best values as above or the last test accuracy?

Best, Jang-Hyun

opened by Janghyun1230 1
CORDS gradient calculations for different loss functions

a) Implement gradient calculation for Squared Loss, Negative logistic loss, General loss function gradient computation, Hinge loss.

b) Integrate the new gradient calculation with different selection strategies
enhancement

opened by krishnatejakk 1
Refactor the folders in the repo
Add a folder called benchmarks which has all the results/benchmarks for the various cases. We should remove the results from the main readme and point them to that folder. Also, add the notebooks to reproduce the benchmark results

Rename notebooks to tutorials. Add different tutorials based on use-cases (NLP, Vision, SSL, Hyper-parameter tunings, NAS, etc.)
opened by rishabhk108 0
Inquiry about performance of gradmatch

Hello, I ran some experiments with gradmatch and randomonline, and find these two actually reach similar performances after 300 epochs, which is around 93, is there something important to note for reproducing the results? Thanks for your help!

opened by pipilurj 0
Implement faster version of OMP
Implement the following versions of OMP:

FNNOMP (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7012095)

SNNOMP (https://hal.univ-lorraine.fr/hal-01585253/document)

high priority in progress
opened by krishnatejakk 0
Gradmatch Data subset selection method making training slow
I tried to run some experiments as follows:

Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.

Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.

Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture. Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?
opened by animesh-007 9
Implement CRUST Algorithm
Implement the CRUST strategy in the supervised learning setting.

Create the CRUST data loader class building it on top of adaptive_dataloader class.

enhancement
opened by krishnatejakk 0

Releases(v0.0.1)

v0.0.1(Mar 24, 2022)
What's Changed

Selcon sahasra by @sahasrarjn in https://github.com/decile-team/cords/pull/73

Selcon sahasra by @sahasrarjn in https://github.com/decile-team/cords/pull/74

New Contributors

@sahasrarjn made their first contribution in https://github.com/decile-team/cords/pull/73

Full Changelog: https://github.com/decile-team/cords/compare/v0.0.0...v0.0.1
Source code(tar.gz)
Source code(zip)
v0.0.0(Mar 4, 2022)
Pre-release of CORDS

What's Changed

Dev by @krishnatejakk in https://github.com/decile-team/cords/pull/9

CONFIG Files Pull by @krishnatejakk in https://github.com/decile-team/cords/pull/10

New Gradient Computation Code by @krishnatejakk in https://github.com/decile-team/cords/pull/11

Feature: add support for hyperparameter tuning with subset selection by @savan77 in https://github.com/decile-team/cords/pull/12

Added checkpoints to save the model and updated documentation by @dheerajnbhat in https://github.com/decile-team/cords/pull/15

test CI and dual tests by @noilreed in https://github.com/decile-team/cords/pull/29

Dual CI flow merge to main by @noilreed in https://github.com/decile-team/cords/pull/30

Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/36

Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/40

Refactor/data loader by @krishnatejakk in https://github.com/decile-team/cords/pull/66

New Contributors

@krishnatejakk made their first contribution in https://github.com/decile-team/cords/pull/9

@savan77 made their first contribution in https://github.com/decile-team/cords/pull/12

@dheerajnbhat made their first contribution in https://github.com/decile-team/cords/pull/15

@noilreed made their first contribution in https://github.com/decile-team/cords/pull/29

Full Changelog: https://github.com/decile-team/cords/commits/v0.0.0
Source code(tar.gz)
Source code(zip)

Owner

decile-team

DECILE: Data EffiCient machIne LEarning

GitHub Repository https://cords.readthedocs.io/en/latest/

JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction

JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction CSCI 544 Final Project done by: Mohammed Alsayed, Shaayan Syed, Mohammad Alali, S

3 Dec 28, 2022

Code for the paper "Adversarial Generator-Encoder Networks"

This repository contains code for the paper "Adversarial Generator-Encoder Networks" (AAAI'18) by Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. Pr

279 Jun 26, 2022

In this project we combine techniques from neural voice cloning and musical instrument synthesis to achieve good results from as little as 16 seconds of target data.

Neural Instrument Cloning In this project we combine techniques from neural voice cloning and musical instrument synthesis to achieve good results fro

127 Dec 23, 2022

Using machine learning to predict undergrad college admissions.

College-Prediction Project- Overview: Many have tried, many have failed. Few trailblazers are ambitious enought to chase acceptance into the top 15 un

1 Jan 05, 2022

A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况，但是发现没有什么好用的能拼接gif的工具。于是自己写了个gif拼接小工具。可以自动拼接gif、png和jpg等常见格式。效果从上至下从下至上从左至右从右至左使用克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

AutoTrader AutoTrader is Python-based platform intended to help in the development, optimisation and deployment of automated trading systems. From sim

485 Jan 09, 2023

A novel framework to automatically learn high-quality scanning of non-planar, complex anisotropic appearance.

appearance-scanner About This repository is an implementation of the neural network proposed in Free-form Scanning of Non-planar Appearance with Neura

14 Oct 18, 2022

This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision"

RUAS This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision" A prelimin

2 May 05, 2022

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition Xue, Wenyuan, et al. "TGRNet: A Table Graph Reconstruction Network for Ta

68 Jan 04, 2023

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Megaverse Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research. The efficient design of the engine enables ph

191 Dec 23, 2022

This repository contains the source code of an efficient 1D probabilistic model for music time analysis proposed in ICASSP2022 venue.

Jump Reward Inference for 1D Music Rhythmic State Spaces An implementation of the probablistic jump reward inference model for music rhythmic informat

25 Dec 16, 2022

[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

NEAT: Neural Attention Fields for End-to-End Autonomous Driving Paper | Supplementary | Video | Poster | Blog This repository is for the ICCV 2021 pap

254 Jan 02, 2023

Specificity-preserving RGB-D Saliency Detection

Specificity-preserving RGB-D Saliency Detection Authors: Tao Zhou, Huazhu Fu, Geng Chen, Yi Zhou, Deng-Ping Fan, and Ling Shao. 1. Preface This reposi

35 Jan 08, 2023

FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning PyTorch implementation for the paper: FACIAL: Synthesizing Dynamic Talking

226 Jan 08, 2023

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

PDVC Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021) [paper] [valse论文速递(Chinese)] This repo supports:

118 Dec 16, 2022

[arXiv'22] Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation

Panoptic NeRF Project Page | Paper | Dataset Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation Xiao Fu*, Shangzhan zhang*,

111 Dec 16, 2022

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

TransGanFormer (wip) Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GansFormer and TransGan paper. I

146 Dec 06, 2022

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Related tags

Overview

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

Installation

Next Steps

Tutorials

Documentation

Comments

Releases(v0.0.1)

v0.0.1(Mar 24, 2022)

What's Changed

New Contributors

v0.0.0(Mar 4, 2022)

What's Changed

New Contributors

Owner

decile-team

JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction

Code for the paper "Adversarial Generator-Encoder Networks"

In this project we combine techniques from neural voice cloning and musical instrument synthesis to achieve good results from as little as 16 seconds of target data.

Using machine learning to predict undergrad college admissions.

A small tool to joint picture including gif

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

A novel framework to automatically learn high-quality scanning of non-planar, complex anisotropic appearance.

This is the official code for the paper "Learning with Nested Scene Modeling and Cooperative Architecture Search for Low-Light Vision"

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

This repository contains the source code of an efficient 1D probabilistic model for music time analysis proposed in ICASSP2022 venue.

[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Specificity-preserving RGB-D Saliency Detection

FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

[arXiv'22] Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

《Train in Germany, Test in The USA: Making 3D Object Detectors Generalize》(CVPR 2020)

Algorithmic encoding of protected characteristics and its implications on disparities across subgroups