GithubHelp home page GithubHelp logo

analysiscenter / batchflow Goto Github PK

View Code? Open in Web Editor NEW
196.0 15.0 46.0 155.1 MB

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Home Page: https://analysiscenter.github.io/batchflow/

License: Apache License 2.0

Python 79.96% Jupyter Notebook 20.04%
python3 data-science machine-learning python pipeline-framework pipeline workflow workflow-engine

batchflow's Introduction

License Python PyTorch codecov PyPI Status

BatchFlow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

For more details see the documentation and tutorials.

Main features:

  • flexible batch generaton
  • deterministic and stochastic pipelines
  • datasets and pipelines joins and merges
  • data processing actions
  • flexible model configuration
  • within batch parallelism
  • batch prefetching
  • ready to use ML models and proven NN architectures
  • convenient layers and helper functions to build custom models
  • a powerful research engine with parallel model training and extended experiment logging.

Basic usage

my_workflow = my_dataset.pipeline()
              .load('/some/path')
              .do_something()
              .do_something_else()
              .some_additional_action()
              .save('/to/other/path')

The trick here is that all the processing actions are lazy. They are not executed until their results are needed, e.g. when you request a preprocessed batch:

my_workflow.run(BATCH_SIZE, shuffle=True, n_epochs=5)

or

for batch in my_workflow.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=5):
    # only now the actions are fired and data is being changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

or

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    processed_batch = my_workflow.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    # only now the actions are fired and data is changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

Train a neural network

BatchFlow includes ready-to-use proven architectures like VGG, Inception, ResNet and many others. To apply them to your data just choose a model, specify the inputs (like the number of classes or images shape) and call train_model. Of course, you can also choose a loss function, an optimizer and many other parameters, if you want.

from batchflow.models.torch import ResNet34

my_workflow = my_dataset.pipeline()
              .init_model('model', ResNet34, config={'loss': 'ce', 'classes': 10})
              .load('/some/path')
              .some_transform()
              .another_transform()
              .train_model('ResNet34', inputs=B.images, targets=B.labels)
              .run(BATCH_SIZE, shuffle=True)

For more advanced cases and detailed API see the documentation.

Installation

BatchFlow module is in the beta stage. Your suggestions and improvements are very welcome.

BatchFlow supports python 3.6 or higher.

Stable python package

With poetry

poetry add batchflow

With old-fashioned pip

pip3 install batchflow

Development version

With poetry

poetry add --editable git+https://github.com/analysiscenter/batchflow

With old-fashioned pip

pip install --editable git+https://github.com/analysiscenter/batchflow

Extras

Some batchflow functions and classed require additional dependencies. In order to use that functionality you might need to install batchflow with extras (e.g. batchflow[nn]):

  • image - working with image datasets and plotting
  • nn - for neural networks (includes torch, torchvision, ...)
  • datasets - loading standard datasets (MNIST, CIFAR, ...)
  • profile - performance profiling
  • jupyter - utility functions for notebooks
  • research - multiprocess research
  • telegram - for monitoring pipelines via a telegram bot
  • dev - batchflow development (pylint, pytest, ...)

You can install several extras at once, like batchflow[image,nn,research].

Projects based on BatchFlow

Citing BatchFlow

Please cite BatchFlow in your publications if it helps your research.

DOI

Roman Khudorozhkov et al. BatchFlow library for fast ML workflows. 2017. doi:10.5281/zenodo.1041203
@misc{roman_kh_2017_1041203,
  author       = {Khudorozhkov, Roman and others},
  title        = {BatchFlow library for fast ML workflows},
  year         = 2017,
  doi          = {10.5281/zenodo.1041203},
  url          = {https://doi.org/10.5281/zenodo.1041203}
}

batchflow's People

Contributors

a-arefina avatar akoryagin avatar alexanderkuvaev avatar alexeykozhevin avatar anastasiia-prog avatar annaaltynova avatar anton-br avatar bulatvakhitov avatar cdtn avatar dependabot[bot] avatar dimonovez avatar dmylzenova avatar dpodvyaznikov avatar evgeniys99 avatar gregoryivanov avatar hollowprincess avatar kirillemelyanov avatar nikita-klsh avatar pennyroyall avatar roman-kh avatar sergeytsimfer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batchflow's Issues

Refactor batch data containers

Batch data could be

  • a simple data type (e.g. list, array, etc)
  • a structured container for named components

A components container should support multiple data types for internal data storage:

  • tuple
  • dict

Updating batch data should be possible with

  • a simple data type for one component
  • a tuple, a dict (etc like pandas.DataFrame) or a components container for multiple components

Batch data should support preloaded dataset data via slicing on the first reference.

Can't use list for preloaded data

Hi!
It looks like batches are generated like lists, so it is not worth to make extra conversions to numpy and vise-versa.

Please, consider to fix https://github.com/analysiscenter/dataset/blob/master/dataset/batch.py#L372
I've fixed it with:
try: res = tuple(data_item[self.get_pos(data, comp, index)] if data_item is not None else None for comp, data_item in zip(comps, _data)) except TypeError: res = tuple(list(data_item[i] for i in self.get_pos(data, comp, index)) if data_item is not None else None for comp, data_item in zip(comps, _data))
Quite ugly

Bellow you might find the example.
PS Let me know whether you prefer pull requests or issues. Thank you in advance.

`import sys
import numpy as np
sys.path.append("../..")
from dataset import Dataset, DatasetIndex, ImagesBatch

def show_images(batch):
img = np.concatenate(batch.images, axis=1).reshape(-1, batch.images.shape[1] * len(batch))
fig, ax = plt.subplots(1, figsize=(10, 4))
ax.axis('off')
ax.imshow(img, cmap="gray")
plt.show()

S = 20
def gen_data(num_items, shape):
index = np.arange(num_items)
// data = np.random.randint(0, 255, size=num_items * shape[0] * shape[1])
// data = data.reshape(num_items, shape[0], shape[1]).astype('uint8')
data = list(np.random.randint(0, 255, size=shape[0] * shape[1]).reshape(shape[0], shape[1]) for i in range(num_items))
ds = Dataset(index=index, batch_class=ImagesBatch, preloaded=data)
return ds, data

// Create a dataset
print("Generating...")
dataset, images = gen_data(5, (S, S))

pipeline = (dataset.pipeline()
.rotate(p=.5, angle=60, reshape=False)
)
for b in pipeline.gen_batch(2, shuffle=False):
show_images(b)`

The model just train one epoch

The following is my code, but I don't know why my model just train one epoch, even i have set a loop:

import sys
import numpy as np
from radio import dataset as ds
from radio import CTImagesMaskedBatch as CTIMB
from radio.pipelines import combine_crops
from radio.models import Keras3DUNet
from radio.models.keras.losses import dice_loss
from tqdm import tqdm

# Todo: Pre data
DIR_CANCER = 'data/cancer/*'
DIR_NCANCER = 'data/ncancer/*'
train_time = 1
my_epoch = 20
bs = 4
loss_his = []
cix = ds.FilesIndex(path=DIR_CANCER, dirs=True)
ncix = ds.FilesIndex(path=DIR_NCANCER, dirs=True)
cancerset = ds.Dataset(index=cix, batch_class=CTIMB)
ncancerset = ds.Dataset(index=ncix, batch_class=CTIMB)

print("Len:", len(cancerset), len(ncancerset))

# Todo: Build & Trian model
unet_config = dict(
    input_shape=(1, 32, 64, 64),
    num_targets=1,
    loss=dice_loss)

from radio.dataset import F, V

train_unet_pipeline = (
    combine_crops(cancerset, ncancerset, batch_sizes=(bs, bs))
    .init_variable('loss_acc', 0)
    .init_variable('current_loss', 0)
    .init_variable('loss_history', init_on_each_run=list)
    .init_variable('cancer_len', len(cancerset))
    .init_variable('ncancer_len', len(ncancerset))
    .init_model(
        name='3dunet', model_class=Keras3DUNet,
        config=unet_config, mode='static'
    )
    .train_model(
        name='3dunet', fetches=[V('loss_acc'), V('cancer_len'), V('ncancer_len')], save_to=V('loss_acc'),
        x=F(CTIMB.unpack, component='images', data_format='channels_first'),
        y=F(CTIMB.unpack, component='masks', data_format='channels_first')
    )
    # .run(batch_size=bs, n_epochs=4, drop_last=True, bar=True)
    .print("loss and acc is:", V('loss_acc'))
    .update_variable('loss_history', value=V('loss_acc'), mode='a')
    # Notice: here we use train_on_batch to train our model, train_on_batch return 2 metrics ['loss', 'acc'].
)

for i in range(my_epoch):
    print(f"epoch {i}")
    t = train_unet_pipeline.run(epoch=1, batch_size=bs)


# for i in tqdm(range(my_epoch)):
#     train_unet_pipeline.run(batch_size=bs, n_epochs=1, shuffle=True)  # cancer + ncancer = 8
#     # here epoch is nothing, loop train_pip.run to get multi epoch
#     loss_his.append(train_unet_pipeline.get_variable('loss_history'))  # list: [ [loss1, acc1], [loss2, acc2], ... ]

# keras_unet = train_unet_pipeline.get_model_by_name('3dunet')
# keras_unet.save('data/weighsts_'+str(train_time))
# loss_his = np.array(loss_his)
# np.save('data/loss_his'+str(train_time)+'.npy', loss_his)

Tutorials don't work

After transition to PIL images actions don't work at all.
Try examples/tutorials/02_pipeline_operations.ipynb

Update conv_block layers

  • rename DropBlock to B
  • get back AlphaDropout as D
  • add depth-wise only separable conv
  • reconcile local and tf separable layers

Add scikit-learn tutorial

Add a tutorial for scikit-learn models, e.g. Logistic regression, SGDClassifier, etc, with fit and partial_fit.

There is no current event loop in thread

There is a problem in Windows with radio library:

13:02:01 ERROR: ('Failed parallelizing. Some of the workers failed with following errors: ', [RuntimeError("There is no current event loop in thread 'Thread-7'.",)])

The way to fix it described here:
https://stackoverflow.com/questions/46727787/runtimeerror-there-is-no-current-event-loop-in-thread-in-async-apscheduler?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

In two words, we must add asyncio.set_event_loop(loop) after loop = asyncio.new_event_loop()

Incorrect installation via pip

Hey folks,
I've just installed the latest version of dataset using pip (python 3.5.2), so that I could try this example. However, when I tried to import best_practice, it failed with something like ImportError: cannot import enable_best_practice. I only managed to solve the problem by installing the package as a submodule, which seems fishy. You probably forgot to add something somewhere - maybe setup.py.

Also, when I managed to successfully install dataset, I've encountered the same problem as #224, but that's a whole different story

issue with next_batch()

Is there a current work around?
error:
Traceback (most recent call last):
File "/Users/matthewmcquaigue/PycharmProjects/LungCancerDetection/driver.py", line 51, in
batch = combine_pipeline.next_batch()
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1239, in next_batch
batch_res = self.next_batch(*args, **kwargs)
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1244, in next_batch
batch_res = next(self._batch_generator)
StopIteration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.