GithubHelp home page GithubHelp logo

analysiscenter / batchflow Goto Github PK

View Code? Open in Web Editor NEW
198.0 16.0 46.0 155.26 MB

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Home Page: https://analysiscenter.github.io/batchflow/

License: Apache License 2.0

Python 80.01% Jupyter Notebook 19.99%
python3 data-science machine-learning python pipeline-framework pipeline workflow workflow-engine

batchflow's People

Contributors

a-arefina avatar akoryagin avatar alexanderkuvaev avatar alexeykozhevin avatar anastasiia-prog avatar annaaltynova avatar anton-br avatar bulatvakhitov avatar cdtn avatar dependabot[bot] avatar dimonovez avatar dpodvyaznikov avatar evgeniys99 avatar gregoryivanov avatar hollowprincess avatar kirillemelyanov avatar nikita-klsh avatar pennyroyall avatar roman-kh avatar sergeytsimfer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batchflow's Issues

Add scikit-learn tutorial

Add a tutorial for scikit-learn models, e.g. Logistic regression, SGDClassifier, etc, with fit and partial_fit.

There is no current event loop in thread

There is a problem in Windows with radio library:

13:02:01 ERROR: ('Failed parallelizing. Some of the workers failed with following errors: ', [RuntimeError("There is no current event loop in thread 'Thread-7'.",)])

The way to fix it described here:
https://stackoverflow.com/questions/46727787/runtimeerror-there-is-no-current-event-loop-in-thread-in-async-apscheduler?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

In two words, we must add asyncio.set_event_loop(loop) after loop = asyncio.new_event_loop()

Can't use list for preloaded data

Hi!
It looks like batches are generated like lists, so it is not worth to make extra conversions to numpy and vise-versa.

Please, consider to fix https://github.com/analysiscenter/dataset/blob/master/dataset/batch.py#L372
I've fixed it with:
try: res = tuple(data_item[self.get_pos(data, comp, index)] if data_item is not None else None for comp, data_item in zip(comps, _data)) except TypeError: res = tuple(list(data_item[i] for i in self.get_pos(data, comp, index)) if data_item is not None else None for comp, data_item in zip(comps, _data))
Quite ugly

Bellow you might find the example.
PS Let me know whether you prefer pull requests or issues. Thank you in advance.

`import sys
import numpy as np
sys.path.append("../..")
from dataset import Dataset, DatasetIndex, ImagesBatch

def show_images(batch):
img = np.concatenate(batch.images, axis=1).reshape(-1, batch.images.shape[1] * len(batch))
fig, ax = plt.subplots(1, figsize=(10, 4))
ax.axis('off')
ax.imshow(img, cmap="gray")
plt.show()

S = 20
def gen_data(num_items, shape):
index = np.arange(num_items)
// data = np.random.randint(0, 255, size=num_items * shape[0] * shape[1])
// data = data.reshape(num_items, shape[0], shape[1]).astype('uint8')
data = list(np.random.randint(0, 255, size=shape[0] * shape[1]).reshape(shape[0], shape[1]) for i in range(num_items))
ds = Dataset(index=index, batch_class=ImagesBatch, preloaded=data)
return ds, data

// Create a dataset
print("Generating...")
dataset, images = gen_data(5, (S, S))

pipeline = (dataset.pipeline()
.rotate(p=.5, angle=60, reshape=False)
)
for b in pipeline.gen_batch(2, shuffle=False):
show_images(b)`

Incorrect installation via pip

Hey folks,
I've just installed the latest version of dataset using pip (python 3.5.2), so that I could try this example. However, when I tried to import best_practice, it failed with something like ImportError: cannot import enable_best_practice. I only managed to solve the problem by installing the package as a submodule, which seems fishy. You probably forgot to add something somewhere - maybe setup.py.

Also, when I managed to successfully install dataset, I've encountered the same problem as #224, but that's a whole different story

Refactor batch data containers

Batch data could be

  • a simple data type (e.g. list, array, etc)
  • a structured container for named components

A components container should support multiple data types for internal data storage:

  • tuple
  • dict

Updating batch data should be possible with

  • a simple data type for one component
  • a tuple, a dict (etc like pandas.DataFrame) or a components container for multiple components

Batch data should support preloaded dataset data via slicing on the first reference.

The model just train one epoch

The following is my code, but I don't know why my model just train one epoch, even i have set a loop:

import sys
import numpy as np
from radio import dataset as ds
from radio import CTImagesMaskedBatch as CTIMB
from radio.pipelines import combine_crops
from radio.models import Keras3DUNet
from radio.models.keras.losses import dice_loss
from tqdm import tqdm

# Todo: Pre data
DIR_CANCER = 'data/cancer/*'
DIR_NCANCER = 'data/ncancer/*'
train_time = 1
my_epoch = 20
bs = 4
loss_his = []
cix = ds.FilesIndex(path=DIR_CANCER, dirs=True)
ncix = ds.FilesIndex(path=DIR_NCANCER, dirs=True)
cancerset = ds.Dataset(index=cix, batch_class=CTIMB)
ncancerset = ds.Dataset(index=ncix, batch_class=CTIMB)

print("Len:", len(cancerset), len(ncancerset))

# Todo: Build & Trian model
unet_config = dict(
    input_shape=(1, 32, 64, 64),
    num_targets=1,
    loss=dice_loss)

from radio.dataset import F, V

train_unet_pipeline = (
    combine_crops(cancerset, ncancerset, batch_sizes=(bs, bs))
    .init_variable('loss_acc', 0)
    .init_variable('current_loss', 0)
    .init_variable('loss_history', init_on_each_run=list)
    .init_variable('cancer_len', len(cancerset))
    .init_variable('ncancer_len', len(ncancerset))
    .init_model(
        name='3dunet', model_class=Keras3DUNet,
        config=unet_config, mode='static'
    )
    .train_model(
        name='3dunet', fetches=[V('loss_acc'), V('cancer_len'), V('ncancer_len')], save_to=V('loss_acc'),
        x=F(CTIMB.unpack, component='images', data_format='channels_first'),
        y=F(CTIMB.unpack, component='masks', data_format='channels_first')
    )
    # .run(batch_size=bs, n_epochs=4, drop_last=True, bar=True)
    .print("loss and acc is:", V('loss_acc'))
    .update_variable('loss_history', value=V('loss_acc'), mode='a')
    # Notice: here we use train_on_batch to train our model, train_on_batch return 2 metrics ['loss', 'acc'].
)

for i in range(my_epoch):
    print(f"epoch {i}")
    t = train_unet_pipeline.run(epoch=1, batch_size=bs)


# for i in tqdm(range(my_epoch)):
#     train_unet_pipeline.run(batch_size=bs, n_epochs=1, shuffle=True)  # cancer + ncancer = 8
#     # here epoch is nothing, loop train_pip.run to get multi epoch
#     loss_his.append(train_unet_pipeline.get_variable('loss_history'))  # list: [ [loss1, acc1], [loss2, acc2], ... ]

# keras_unet = train_unet_pipeline.get_model_by_name('3dunet')
# keras_unet.save('data/weighsts_'+str(train_time))
# loss_his = np.array(loss_his)
# np.save('data/loss_his'+str(train_time)+'.npy', loss_his)

issue with next_batch()

Is there a current work around?
error:
Traceback (most recent call last):
File "/Users/matthewmcquaigue/PycharmProjects/LungCancerDetection/driver.py", line 51, in
batch = combine_pipeline.next_batch()
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1239, in next_batch
batch_res = self.next_batch(*args, **kwargs)
File "/Users/matthewmcquaigue/anaconda3/envs/cv/lib/python3.6/site-packages/radio/batchflow/batchflow/pipeline.py", line 1244, in next_batch
batch_res = next(self._batch_generator)
StopIteration

Update conv_block layers

  • rename DropBlock to B
  • get back AlphaDropout as D
  • add depth-wise only separable conv
  • reconcile local and tf separable layers

Tutorials don't work

After transition to PIL images actions don't work at all.
Try examples/tutorials/02_pipeline_operations.ipynb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.