GithubHelp home page GithubHelp logo

decile-team / cords Goto Github PK

View Code? Open in Web Editor NEW
315.0 13.0 52.0 58.87 MB

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Home Page: https://cords.readthedocs.io/en/latest/

License: MIT License

Python 45.44% Jupyter Notebook 54.56%
energy machine-learning deep-learning energy-requirements compute-efficient-ml speedups-training

cords's Introduction


            

COResets and Data Subset selection

GitHub Decile Documentation GitHub Stars GitHub Forks

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).

Some of the algorithms currently implemented with CORDS include:

For Efficient and Robust Supervised Learning:

For Efficient and Robust Semi-supervised Learning:

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

  • Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
  • Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
  • Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
  • Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
  • A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.

Highlights

  • 3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
  • 3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
  • 3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)

Starting with CORDS

Pip Installation

To install the latest version of the CORDS package using PyPI:

pip install cords

From Git Repository

To install using the source:

git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt

First Steps

To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:

Using subset selection based data loaders

Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.

Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.

Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,

from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader

#Pass on necessary arguments for GLISTERDataLoader
dss_args = dict(model=model,
                loss=criterion_nored,
                eta=0.01,
                num_classes=10,
                num_epochs=300,
                device='cuda',
                fraction=0.1,
                select_every=20,
                kappa=0,
                linear_layer=False,
                selection_type='SL',
                greedy='Stochastic')
dss_args = DotMap(dss_args)

#Create GLISTER subset selection dataloader
dataloader = GLISTERDataLoader(trainloader, 
                                valloader, 
                                dss_args, 
                                logger, 
                                batch_size=20, 
                                shuffle=True,
                                pin_memory=False)

for epoch in range(num_epochs):
    for _, (inputs, targets, weights) in enumerate(dataloader):
        """
        Standard PyTorch training loop using weighted loss
        
        Our training loop differs from the standard PyTorch training loop in that along with 
        data samples and their associated target labels; we also have additional sample weight
        information from the subset data loader, which can be used to calculate the weighted 
        loss for gradient descent. We can calculate the weighted loss by using default PyTorch
        loss functions with no reduction.
        """

In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.

Using default supervised training loop,

from train_sl import TrainClassifier
from cords.utils.config_utils import load_config_data

config_file = '/content/cords/configs/SL/config_glister_cifar10.py'
cfg = load_config_data(config_file)
clf = TrainClassifier(cfg)
clf.train()

Using default semi-supervised training loop,

from train_ssl import TrainClassifier
from cords.utils.config_utils import load_config_data

config_file = '/content/cords/configs/SSL/config_retrieve-warm_vat_cifar10.py'
cfg = load_config_data(config_file)
clf = TrainClassifier(cfg)
clf.train()

You can use the default configurations that we have provided in the configs folder, or you can make a custom configuration. For making your custom configuration file for training, please refer to CORDS Configuration File Documentation.

Applications

Efficient Hyper-parameter Optimization(HPO)

The subset selection strategies for efficient supervised learning in CORDS allow one to train models faster. We can use the faster model training using data subsets for quicker configuration evaluations in Hyper-parameter tuning. A detailed pipeline figure of efficient hyper-parameter tuning using subset based training for faster configuration evaluations can be seen below:



We can use any existing data subset selection strategy in CORDS along with existing hyperparameter search and scheduling algorithms currently. We currently use Ray-Tune library for hyper-parameter tuning and search algorithms.

Please find the tutorial notebook explaining the usage of CORDS subset selections strategies for Efficient Hyper-parameter optimization in the following notebook

Speedups achieved using CORDS

To achieve significantly faster speedups, one can use the subset selection data loaders from CORDS while keeping the training algorithm the same. Look at the speedups one can achieve using the subset selection data loaders from CORDS below:

SpeedUps in Supervised Learning



SpeedUps in Semi-supervised Learning



SpeedUps in Hyperparameter Tuning



Tutorials

We have added example python code and tutorial notebooks under the examples folder. See this link

Documentation

The documentation for the latest version of CORDS can always be found here.

Contributing to CORDS

We value and encourage contributions from the open-source community to enhance the CORDS library. Here are some guidelines for contributing:

  1. Report issues: If you come across any bugs or have suggestions for improvements, please raise an issue on our GitHub repository. Provide detailed information about the problem or feature request, including steps to reproduce the issue if applicable.

  2. Feature requests: If you have ideas for new features or enhancements, feel free to submit a feature request on GitHub. Clearly describe the proposed functionality and how it aligns with the goals of the CORDS library.

  3. Code contributions: We welcome code contributions to improve CORDS. If you plan to contribute code, please follow these steps:

    • Fork the CORDS repository on GitHub.
    • Create a new branch for your work based on the develop branch.
    • Make your changes and ensure they are well-documented and tested.
    • Submit a pull request, providing a clear explanation of the changes made and their purpose.
  4. Code style: When contributing code, please adhere to the existing code style and formatting conventions used in the CORDS library. Consistency in code style helps maintain readability and makes it easier to review and merge contributions.

  5. Testing: Ensure that your code changes pass the existing tests

Mailing List

To receive updates about CORDS and to be a part of the community, join the Decile_CORDS_Dev group.

https://groups.google.com/forum/#!forum/Decile_CORDS_Dev/join 

Acknowledgment

This library takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include Teppei Suzuki's consistency based SSL repository and Richard Liaw's Tune repository. Also, CORDS uses submodlib for submodular optimization.

Team

CORDS is created and maintained by Krishnateja Killamsetty, Dheeraj N Bhat, Rishabh Iyer, and Ganesh Ramakrishnan. We look forward to have CORDS more community driven. Please use it and contribute to it for your efficient learning research, and feel free to use it for your commercial projects. We will add the major contributors here.

Resources

Blog Articles

Publications

[1]: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer, “AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning”. arXiv [cs.LG], 2022. arXiv:2203:08212.

[2]: Krishnateja Killamsetty, Xujiang Zhou, Feng Chen, and Rishabh Iyer, “RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning”. To Appear in Neural Information Processing Systems, NeurIPS 2021.

[3]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer. “GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training”. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 139:5464–5474. Proceedings of Machine Learning Research. PMLR, 2021.

[4]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer. “GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning”. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 8110–8118. AAAI Press, 2021.

[5]: Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. “Coresets for Data-efficient Training of Machine Learning Models”. In International Conference on Machine Learning (ICML), July 2020

[6]: Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, “Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision”. 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[7]: Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019).

[8]: Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” In ACL 2019.

[9]: Kai Wei, Rishabh Iyer, Jeff Bilmes, “Submodularity in Data Subset Selection and Active Learning”. International Conference on Machine Learning (ICML) 2015

[10]: Wei, Kai, et al. Submodular subset selection for large-scale speech training data. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

cords's People

Contributors

aakriti28 avatar atul04 avatar dduan97 avatar dheerajnbhat avatar dssresearch avatar durgas16 avatar gsaiabhishek avatar krishnatejakk avatar pizhn avatar rishabhk108 avatar sahasrarjn avatar savan77 avatar t-kkillamset avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cords's Issues

For GRAD_MATCH method, the weights associated with each data point in X(subset of training set)

  1. For GRAD-MATCH method, there are weights associated with each data point in X(subset of training set). Do the weights have physical significance? for example, if the value of the weight is higher, the relevant selected data has the greater contribution to the residual?
  2. During the iteration, the selective index is in the selected indices, so the iteration break. why this happen?
    thanks@krishnatejakk

Evaluation on ImageNet

Hello, thanks for a very interesting and useful project.

Could you mind providing an evaluation method for ImageNet?
I tried to, adding loader for ImageNet to custom_dataset.py, but failed due to a GPU memory issue during subset selection.

Many thanks!

Can't install on MACOS

I'm trying to install CORDS on macOS, but there is this dependency conflict. It looks like there is no torchtext==0.10.1 for Mac. Is there any tutorial to make it work on macOS?

Synthetic data experiments and tutorials

  1. Perform subset selection experiments on synthetic data with a detailed visualization of the subsets selected and the testing performance of the models when trained on these subsets.

Models and Examples for Tabular Data

Hi, I'm interested in using cords for tabular data and I noticed a planned work in this page, but which has not been realized yet. Is there any plan regarding this? How could I contribute?
Thanks!

Refactor the folders in the repo

  • Add a folder called benchmarks which has all the results/benchmarks for the various cases. We should remove the results from the main readme and point them to that folder. Also, add the notebooks to reproduce the benchmark results
  • Rename notebooks to tutorials. Add different tutorials based on use-cases (NLP, Vision, SSL, Hyper-parameter tunings, NAS, etc.)

Logistic Regression support for Gradmatch

Logistic Regression model throws errors when we do back propagation. The fix for this is perhaps making freeze=False in forward function of utils/models/logreg_net.py

Possible bug calculating "trn_loss" and "tst_loss"

Hello,

I have noticed a potential bug in the calculation of trn_loss and test_loss. The trn_loss is currently computed on the entire train dataset using train_eval_loader. This data loader has a batch size that is 20 times larger than that of trainloader. Consequently, when calculating the trn_loss with the train_eval_loader, it is necessary to use the batch size of train_eval_loader rather than the batch size of trainloader.

Likewise, when calculating the test_loss, we should use the batch size of test_eval_loader instead of the batch size of testloader.

trn_loss += (loss.item() * trainloader.batch_size)

tst_loss += (loss.item() * testloader.batch_size)

[Bug] Got weight with same value when running examples.

Hi, I tested the example with Supervised learning and Glister strategy.
https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb
But when I print the weight of the train loader, they are all 1.0. I believe that by using Glister strategy, we will get different weights.

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.], device='cuda:0')

Is that a bug or something special?
Thanks.

Noisy Label experiments

  1. Perform a detailed analysis of the performance of the subset selection strategies in the presence of noise labels.

Implement CRUST Algorithm

  1. Implement the CRUST strategy in the supervised learning setting.
  2. Create the CRUST data loader class building it on top of adaptive_dataloader class.

Documentation Improvement

  1. Hyperparameter tuning results and the colab tutorial

  2. Make sure all the tutorial links are working

  3. One document explaining all configurable parameters

  4. New main page in documentation with the current results for CORDS

  5. Remove models doc in documentation with a list of all available models

6)Go over the documentation and make sure everything is okay

Typo in cords_cifar10_glister_train.ipynb

There is a typo in the cords_cifar10_glister_train.ipynb notebook :
https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/cords_cifar10_glister_train.ipynb

glister_trn.configdata.train_args.print_every = 1
glister_trn.configdata.train_args.device = 'cuda'
glister_trn.configdata.dss_args.fraction = fraction

instead of

glister_trn.cfg.train_args.print_every = 1
glister_trn.cfg.train_args.device = 'cuda'
glister_trn.cfg.dss_args.fraction = fraction

Update Documentation

  1. Update readitdocs with new documentation and release the latest version of CORDS

Segmentation fault (core dumped)

Hi,

I was trying to deploy CORDS selection to my training, but this error popped out Segmentation fault (core dumped).

I imitated code from https://github.com/decile-team/cords/blob/main/examples/SL/image_classification/python_notebooks/CORDS_SL_CIFAR10_Custom_Train.ipynb.

So basically I put my training and testing loader into GLISTERDataLoader, and switched this part into my code

for _, (inputs, targets, weights) in enumerate(dataloader): inputs = inputs.to(device) targets = targets.to(device, non_blocking=True) weights = weights.to(device) optimizer.zero_grad() outputs = model(inputs) losses = criterion_nored(outputs, targets) loss = torch.dot(losses, weights/(weights.sum())) loss.backward()

before modifying my code was running fine, so I believe there is an error inside the CORDS, my dataset is CIFAR10.

Thanks

Detectron2 usage

Hi, how to add sampling hooks to Detectron2 using this library

Inquiry about performance of gradmatch

Hello, I ran some experiments with gradmatch and randomonline, and find these two actually reach similar performances after 300 epochs, which is around 93, is there something important to note for reproducing the results? Thanks for your help!

Questions about accuracy logging

Hello! Thanks for your great work.

I'm currently working on this code and I want to ask a question about accuracy logging.

cords/train.py

Lines 530 to 541 in ff629ff

for batch_idx, (inputs, targets) in enumerate(testloader):
# print(batch_idx)
inputs, targets = inputs.to(self.configdata['train_args']['device']), targets.to(self.configdata['train_args']['device'], non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, targets)
tst_loss += loss.item()
tst_losses.append(tst_loss)
if "tst_acc" in print_args:
_, predicted = outputs.max(1)
tst_total += targets.size(0)
tst_correct += predicted.eq(targets).sum().item()
tst_acc.append(tst_correct/tst_total)

In line 541 of train.py, val_acc contains cumulative accuracies over input batches. For example, if the loader contains 4500 examples and the batch size is 1000, then tst_acc has 5 accuracies per each evaluation. (the first element of tst_acc will be the accuracy over the first 1000 examples)

cords/train.py

Lines 631 to 633 in ff629ff

if "tst_loss" in print_args:
if "tst_acc" in print_args:
print("Test Data Loss and Accuracy: ", tst_loss, np.array(tst_acc).max())

In line 633, it prints the best value in tst_acc. In this case, the resulted best accuracies over different algorithms and seeds might be the values evaluated on different test samples.

Is this what you intended? In my experience, I think evaluating algorithms on an identical test dataset is a convention.
In addition, is the reported test accuracies in the GRAD-MATCH paper the best values as above or the last test accuracy?

Best,
Jang-Hyun

Gradmatch Data subset selection method making training slow

I tried to run some experiments as follows:

  • Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.
  • Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.
  • Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture.
Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.