analysiscenter / batchflow Goto Github PK

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Home Page: https://analysiscenter.github.io/batchflow/

License: Apache License 2.0

Python 79.96% Jupyter Notebook 20.04%

python3 data-science machine-learning python pipeline-framework pipeline workflow workflow-engine

batchflow's Introduction

BatchFlow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

For more details see the documentation and tutorials.

Main features:

flexible batch generaton
deterministic and stochastic pipelines
datasets and pipelines joins and merges
data processing actions
flexible model configuration
within batch parallelism
batch prefetching
ready to use ML models and proven NN architectures
convenient layers and helper functions to build custom models
a powerful research engine with parallel model training and extended experiment logging.

Basic usage

my_workflow = my_dataset.pipeline()
              .load('/some/path')
              .do_something()
              .do_something_else()
              .some_additional_action()
              .save('/to/other/path')

The trick here is that all the processing actions are lazy. They are not executed until their results are needed, e.g. when you request a preprocessed batch:

my_workflow.run(BATCH_SIZE, shuffle=True, n_epochs=5)

for batch in my_workflow.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=5):
    # only now the actions are fired and data is being changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    processed_batch = my_workflow.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    # only now the actions are fired and data is changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

Train a neural network

BatchFlow includes ready-to-use proven architectures like VGG, Inception, ResNet and many others. To apply them to your data just choose a model, specify the inputs (like the number of classes or images shape) and call train_model. Of course, you can also choose a loss function, an optimizer and many other parameters, if you want.

from batchflow.models.torch import ResNet34

my_workflow = my_dataset.pipeline()
              .init_model('model', ResNet34, config={'loss': 'ce', 'classes': 10})
              .load('/some/path')
              .some_transform()
              .another_transform()
              .train_model('ResNet34', inputs=B.images, targets=B.labels)
              .run(BATCH_SIZE, shuffle=True)

For more advanced cases and detailed API see the documentation.

Installation

BatchFlow module is in the beta stage. Your suggestions and improvements are very welcome.

BatchFlow supports python 3.6 or higher.

Stable python package

With poetry

poetry add batchflow

With old-fashioned pip

pip3 install batchflow

Development version

With poetry

poetry add --editable git+https://github.com/analysiscenter/batchflow

With old-fashioned pip

pip install --editable git+https://github.com/analysiscenter/batchflow

Extras

Some batchflow functions and classed require additional dependencies. In order to use that functionality you might need to install batchflow with extras (e.g. batchflow[nn]):

image - working with image datasets and plotting
nn - for neural networks (includes torch, torchvision, ...)
datasets - loading standard datasets (MNIST, CIFAR, ...)
profile - performance profiling
jupyter - utility functions for notebooks
research - multiprocess research
telegram - for monitoring pipelines via a telegram bot
dev - batchflow development (pylint, pytest, ...)

You can install several extras at once, like batchflow[image,nn,research].

Projects based on BatchFlow

SeismiQB - ML for seismic interpretation
SeismicPro - ML for seismic processing
PyDEns - DL Solver for ODE and PDE
RadIO - ML for CT imaging
CardIO - ML for heart signals

Citing BatchFlow

Please cite BatchFlow in your publications if it helps your research.

Roman Khudorozhkov et al. BatchFlow library for fast ML workflows. 2017. doi:10.5281/zenodo.1041203

@misc{roman_kh_2017_1041203,
  author       = {Khudorozhkov, Roman and others},
  title        = {BatchFlow library for fast ML workflows},
  year         = 2017,
  doi          = {10.5281/zenodo.1041203},
  url          = {https://doi.org/10.5281/zenodo.1041203}
}

batchflow's People

Contributors

Stargazers

Watchers

batchflow's Issues

Update torch.Unet with EncoderDecoder

Do this after #345

Create within-batch parallelism

Add DenseNet to torch

Make tests for _assemble

Refactor batch data containers

Batch data could be

a simple data type (e.g. list, array, etc)
a structured container for named components

A components container should support multiple data types for internal data storage:

tuple
dict

Updating batch data should be possible with

a simple data type for one component
a tuple, a dict (etc like pandas.DataFrame) or a components container for multiple components

Batch data should support preloaded dataset data via slicing on the first reference.

Can't use list for preloaded data

Hi!
It looks like batches are generated like lists, so it is not worth to make extra conversions to numpy and vise-versa.

Please, consider to fix https://github.com/analysiscenter/dataset/blob/master/dataset/batch.py#L372
I've fixed it with:
try: res = tuple(data_item[self.get_pos(data, comp, index)] if data_item is not None else None for comp, data_item in zip(comps, _data)) except TypeError: res = tuple(list(data_item[i] for i in self.get_pos(data, comp, index)) if data_item is not None else None for comp, data_item in zip(comps, _data))
Quite ugly

Bellow you might find the example.
PS Let me know whether you prefer pull requests or issues. Thank you in advance.

`import sys
import numpy as np
sys.path.append("../..")
from dataset import Dataset, DatasetIndex, ImagesBatch

def show_images(batch):
img = np.concatenate(batch.images, axis=1).reshape(-1, batch.images.shape[1] * len(batch))
fig, ax = plt.subplots(1, figsize=(10, 4))
ax.axis('off')
ax.imshow(img, cmap="gray")
plt.show()

S = 20
def gen_data(num_items, shape):
index = np.arange(num_items)
// data = np.random.randint(0, 255, size=num_items * shape[0] * shape[1])
// data = data.reshape(num_items, shape[0], shape[1]).astype('uint8')
data = list(np.random.randint(0, 255, size=shape[0] * shape[1]).reshape(shape[0], shape[1]) for i in range(num_items))
ds = Dataset(index=index, batch_class=ImagesBatch, preloaded=data)
return ds, data

// Create a dataset
print("Generating...")
dataset, images = gen_data(5, (S, S))

pipeline = (dataset.pipeline()
.rotate(p=.5, angle=60, reshape=False)
)
for b in pipeline.gen_batch(2, shuffle=False):
show_images(b)`

The model just train one epoch

The following is my code, but I don't know why my model just train one epoch, even i have set a loop:

import sys
import numpy as np
from radio import dataset as ds
from radio import CTImagesMaskedBatch as CTIMB
from radio.pipelines import combine_crops
from radio.models import Keras3DUNet
from radio.models.keras.losses import dice_loss
from tqdm import tqdm

# Todo: Pre data
DIR_CANCER = 'data/cancer/*'
DIR_NCANCER = 'data/ncancer/*'
train_time = 1
my_epoch = 20
bs = 4
loss_his = []
cix = ds.FilesIndex(path=DIR_CANCER, dirs=True)
ncix = ds.FilesIndex(path=DIR_NCANCER, dirs=True)
cancerset = ds.Dataset(index=cix, batch_class=CTIMB)
ncancerset = ds.Dataset(index=ncix, batch_class=CTIMB)

print("Len:", len(cancerset), len(ncancerset))

# Todo: Build & Trian model
unet_config = dict(
    input_shape=(1, 32, 64, 64),
    num_targets=1,
    loss=dice_loss)

from radio.dataset import F, V

train_unet_pipeline = (
    combine_crops(cancerset, ncancerset, batch_sizes=(bs, bs))
    .init_variable('loss_acc', 0)
    .init_variable('current_loss', 0)
    .init_variable('loss_history', init_on_each_run=list)
    .init_variable('cancer_len', len(cancerset))
    .init_variable('ncancer_len', len(ncancerset))
    .init_model(
        name='3dunet', model_class=Keras3DUNet,
        config=unet_config, mode='static'
    )
    .train_model(
        name='3dunet', fetches=[V('loss_acc'), V('cancer_len'), V('ncancer_len')], save_to=V('loss_acc'),
        x=F(CTIMB.unpack, component='images', data_format='channels_first'),
        y=F(CTIMB.unpack, component='masks', data_format='channels_first')
    )
    # .run(batch_size=bs, n_epochs=4, drop_last=True, bar=True)
    .print("loss and acc is:", V('loss_acc'))
    .update_variable('loss_history', value=V('loss_acc'), mode='a')
    # Notice: here we use train_on_batch to train our model, train_on_batch return 2 metrics ['loss', 'acc'].
)

for i in range(my_epoch):
    print(f"epoch {i}")
    t = train_unet_pipeline.run(epoch=1, batch_size=bs)


# for i in tqdm(range(my_epoch)):
#     train_unet_pipeline.run(batch_size=bs, n_epochs=1, shuffle=True)  # cancer + ncancer = 8
#     # here epoch is nothing, loop train_pip.run to get multi epoch
#     loss_his.append(train_unet_pipeline.get_variable('loss_history'))  # list: [ [loss1, acc1], [loss2, acc2], ... ]

# keras_unet = train_unet_pipeline.get_model_by_name('3dunet')
# keras_unet.save('data/weighsts_'+str(train_time))
# loss_his = np.array(loss_his)
# np.save('data/loss_his'+str(train_time)+'.npy', loss_his)

Make tests for model loading / saving

Add UNet++

https://arxiv.org/abs/1807.10165

Tutorials don't work

After transition to PIL images actions don't work at all.
Try examples/tutorials/02_pipeline_operations.ipynb

Add PSPNet

https://arxiv.org/abs/1612.01105v2

DatasetIndex is hard-coded in cv_split

Thus when you split FilesIndex, its train would be DatasetIndex, and not FilesIndex

Update EncoderDecoder in torch

It seems that filters default arg in head of vnet is missing

https://github.com/analysiscenter/dataset/blob/8fee515c97b8a85e7ba9ccd2c8d769e8c0a1c3c1/dataset/models/tf/vnet.py#L40

As it raises ValueError: filters cannot be None or 0 if layout includes convolutional layers

Update conv_block layers

rename DropBlock to B
get back AlphaDropout as D
add depth-wise only separable conv
reconcile local and tf separable layers

Move to TF 1.14 and 2.0

Replace all deprecated layers from tf.layers with tf.keras.layers

Allow for block chaining in TFModel and TorchModel

All the model block (initial_block, body, head) could be configured as a list of dict. Then each block is added sequentially for each list item.

Add scikit-learn tutorial

Add a tutorial for scikit-learn models, e.g. Logistic regression, SGDClassifier, etc, with fit and partial_fit.

Add Xception to torch

Add new activations

Like https://arxiv.org/abs/1908.08681v1, https://arxiv.org/abs/1812.06247, https://arxiv.org/abs/1901.05894

Add a tutorial how to work with ordinary tf and torch models

There is no current event loop in thread

There is a problem in Windows with radio library:

13:02:01 ERROR: ('Failed parallelizing. Some of the workers failed with following errors: ', [RuntimeError("There is no current event loop in thread 'Thread-7'.",)])

The way to fix it described here:
https://stackoverflow.com/questions/46727787/runtimeerror-there-is-no-current-event-loop-in-thread-in-async-apscheduler?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

In two words, we must add asyncio.set_event_loop(loop) after loop = asyncio.new_event_loop()

Make torch.PyramidNet

Make tests for tutorials

Incorrect installation via pip

Hey folks,
I've just installed the latest version of dataset using pip (python 3.5.2), so that I could try this example. However, when I tried to import best_practice, it failed with something like ImportError: cannot import enable_best_practice. I only managed to solve the problem by installing the package as a submodule, which seems fishy. You probably forgot to add something somewhere - maybe setup.py.

Also, when I managed to successfully install dataset, I've encountered the same problem as #224, but that's a whole different story

TF Model should be able to load with and w/o graph (i.e. with and w/o building)

Update tf.Vnet with EncoderDecoder

Create inter-batch parallelism

Update torch.Vnet with EncoderDecoder

Do this after #345

Add RegressionMetrics

and tests

Update tf.Unet with EncoderDecoder

Add open datasets Imagenette and Imagewoof

https://github.com/fastai/imagenette

Add tests for FilesIndex

esp. for create_subset, create_batch and so on

Make batch pre-fetching

Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

Hi!
I've tried to run several examples from the directory. For example, trying to run https://github.com/analysiscenter/dataset/tree/master/examples/simple_but_ugly fails with bunch of errors. Looks like there are no several files with definitions (random_scale, random_rotate, convert_to_pil, etc)

Add MobileNet to torch

issue with next_batch()

Is there a current work around?
error:
Traceback (most recent call last):
File "/Users/matthewmcquaigue/PycharmProjects/LungCancerDetection/driver.py", line 51, in
batch = combine_pipeline.next_batch()
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1239, in next_batch
batch_res = self.next_batch(*args, **kwargs)
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1244, in next_batch
batch_res = next(self._batch_generator)
StopIteration