graykode / matorage Goto Github PK

Matorage is tensor(multidimensional matrix) object storage manager for deep learning framework(Pytorch, Tensorflow V2, Keras)

Home Page: https://matorage.readthedocs.io

License: Other

Python 100.00%

deep-learning storage-manager pytorch tensorflow

matorage's Introduction

matorage

An efficient way to store/load and manage dataset, model and optimizer for deep learning with matorage!

Matorage is tensor(multidimensional matrix) object storage manager for deep learning framework(Pytorch, Tensorflow V2, Keras).

Features

Boilerplated data pipeline for dataset, model and optimizer.
High performance on tensor storage

For researchers who need to focus on model training:

Support storing data in pre-processed Tensor(multidimensional matrix), eliminate training time.
Reduce storage space through multiple compression methods.
Manage data and models while training

For AI Developer who need to focus on creating data pipeline:

Concurrency data save & load
Compatible with object storage such as MinIO, S3
Generate pipeline from user endpoints data.

Quick Start with Pytorch Example

For an example of tensorflow, refer to the detailed document. If you want to see the full code, see below

0. Install matorage with pip

$ pip install matorage

1. Set up Minio Server with docker

quick start with NAS(network access storage) using docker It can be managed through the web through the address http://127.0.0.1:9000/, and security is managed through MINIO_ACCESS_KEY and MINIO_SECRET_KEY.

$ mkdir ~/shared # create nas storage folder
$ docker run -it -p 9000:9000 \
    --restart always -e \
    "MINIO_ACCESS_KEY=minio" -e \
    "MINIO_SECRET_KEY=miniosecretkey" \
    -v ~/shared:/container/vol \
    minio/minio gateway nas /container/vol

2. Save pre-processed dataset

First, create a DataConfig by importing matorage. This is an example of pre-processing mnist and storing it in distributed storage. additional is freely in the form of a dict, and records the shape and type of tensor to be stored in attributes.

from matorage import DataConfig

traindata_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='mnist',
    additional={
        "mode": "train",
        "framework" : "pytorch",
        ...
        "blah" : "blah"
    },
    attributes=[
        ('image', 'float32', (1, 28, 28)),
        ('target', 'int64', (1))
    ]
)

Now do a simple pre-processing and save the data.

from matorage import DataSaver

traindata_saver = DataSaver(config=traindata_config)
train_loader = DataLoader(dataset, batch_size=60, num_workers=8)
for (image, target) in tqdm(train_loader):
    # image shape : torch.Size([64, 1, 28, 28])
    # target shape : torch.Size([64])
    traindata_saver({
        'image': image,
        'target': target
    })
traindata_saver.disconnect()

3. Load dataset from matorage

Now fetch data iteratively from storage with the same config as the saved dataset when training.

from matorage.torch import Dataset

train_dataset = Dataset(config=traindata_config, clear=True)
train_loader = DataLoader(
    train_dataset, batch_size=64, num_workers=8, shuffle=True
)

for batch_idx, (image, target) in enumerate(tqdm(train_loader)):
    image, target = image.to(device), target.to(device)

Only an index can be fetched through lazy load.

train_dataset = Dataset(config=traindata_config, clear=True)
print(train_dataset[0], len(train_dataset))

4. Save & Load Model when training

During training, you can save and load models of specific steps or epochs in distributed storage through inmemory. First, make the model config the same as the dataset.

from matorage import ModelConfig
from matorage.torch import ModelManager

model_config = ModelConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    model_name='mnist_simple_training',
    additional={
        "version" : "1.0.1",
        ...
        "blah" : "blah"
    }
)

model_manager = ModelManager(config=model_config)
print(model_manager.get_metadata)
model_manager.save(model, epoch=1)
print(model_manager.get_metadata)

When an empty model is loaded with specific steps or epochs, the appropriate weight is filled into the model.

print(model.state_dict())
model_manager.load(model, epoch=1)
print(model.state_dict())
# load a layer weight.
print(model_manager.load('net1.0.weight', step=0))

5. Save & Load Optimizer when training

Save and load of optimizer is similar to managing model.

from matorage import OptimizerConfig
from matorage.torch import OptimizerManager

optimizer_config = OptimizerConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    optimizer_name='adam',
    additional={
        "model" : "1.0.1",
        ...
        "blah" : "blah"
    }
)

optimizer_manager = OptimizerManager(config=optimizer_config)
print(optimizer_manager.get_metadata)
# The optimizer contains information about the step.
optimizer_manager.save(optimizer)
print(optimizer_manager.get_metadata)

When an empty optimizer is loaded with specific steps, the appropriate weight is filled into the optimizer.

optimizer = optim.Adam(model.parameters(), lr=0.01)
optimizer_manager.load(optimizer, step=938)

Unittest

$ git clone https://github.com/graykode/matorage && cd matorage
$ python -m tests.test_suite

Framework Requirement

torch(>=1.0.0), torchvision(>=0.2.2)
tensorflow(>=2.2), tensorflow_io(>=0.13)

Author

Tae Hwan Jung(@graykode) We are looking for a contributor.

matorage's People

Contributors

Stargazers

Watchers

Forkers

lkykor77 jinserk seongpyohong klassikcat standardgalactic timmylicheng simpledeeplearning talhausf

matorage's Issues

Createt context manager for `DataSaver`

Using the disconnect function after saving data every time is likely to make a mistake.
Therefore, we plan to modify to use python's context manager as follows. #18

with DataSaver(...) as datasaver:
   datasave(...)

metadata file change from FileStorage to Database

Now metadata attribute is managed as a JSON file. However, as a long-term plan, we will modify it to work concurrency with the database.

e.q : data

{
    "additional": {
        "framework": "pytorch",
        "mode": "test"
    },
    "attributes": [
        {
            "itemsize": 0,
            "name": "image",
            "shape": [
                1,
                28,
                28
            ],
            "type": "float32"
        },
        {
            "itemsize": 0,
            "name": "target",
            "shape": [
                1
            ],
            "type": "int64"
        }
    ],
    "compressor": {
        "complevel": 0,
        "complib": "zlib"
    },
    "dataset_name": "mnist",
    "endpoint": "/Users/graykode/shared",
    "filetype": [],
    "indexer": {
        "3335": {
            "length": 3335,
            "name": "tmpuoetuutie1ec9bdf4cb142e8.h5"
        },
        "6670": {
            "length": 3335,
            "name": "tmpzzv9w4r94aac98a99ee74d52.h5"
        },
        "10000": {
            "length": 3330,
            "name": "tmp3qvp1bbtbf74db88d9a0499c.h5"
        }
    }
}

Error when I change the attributes if DataSaver with refresh=True

When I added attributes with DataSaver refresh=True, it gives an error like:

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 205, in <module>
    store_to_matorage(ds)
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 198, in store_to_matorage
    save_data_list(train_list, "train")
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 190, in save_data_list
    data_saver({
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 317, in __call__
    self._check_data_numpytype()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 254, in _check_data_numpytype
    self._check_attr_name(name=name)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 229, in _check_attr_name
    raise KeyError("attribute name {} is not exist!".format(name))
KeyError: 'attribute name inchikey is not exist!'

It seems to check the config attributes as well as the other params, or to reset the attributes when refresh=True.

AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)))

Can I ask you what this error stands for?

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 65, in set_datasets
    dataset = MatorageAnnDataset(trainset_config, clear=True)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 73, in __init__
    super(Dataset, self).__init__(config, **kwargs)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 80, in __init__
    self._init_download()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 189, in _init_download
    assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))
AssertionError

show bucket list each type(dataset, model and optimizer)

Users who use matorage must remember the metadata when saving and use it when loading.
Therefore, if only the storage endpoint is known, implement a function that can display a table with some dataset, model and optimizer currently.

from matorage import StorageConfig

if __name__ == '__main__':
    config = StorageConfig(
        endpoint='127.0.0.1:9000',
        access_key='minio',
        secret_key='miniosecretkey',
    )
    print(config.get_datasets())
    print(config.get_models())
    print(config.get_optimizers())

A few questions for the usage

It's really fantastic! Thank you so much for sharing this project.
I had a quick test with minio docker process and confirmed it works really well as expected.
I'd like to ask a few questions about the usage:

can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?
If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it?
If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not.
It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well?

Save & load for scheduler

tensorflow uses optimizer and scheduler together to save scheduler when saving optimizer, but pytorch is independent.
For example, tensorflow has a scheduler when calling an optimizer as follows.

import tensorflow as tf

if __name__ == '__main__':
    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=0.0,
        decay_steps=100,
        end_learning_rate=0.001,
    )
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=lr_schedule, beta_1=0.99, beta_2=0.1, epsilon=1e-6
    )
    print(lr_schedule.get_config())
    print(optimizer.get_config())

{'initial_learning_rate': 0.0, 'decay_steps': 100, 'end_learning_rate': 0.001, 'power': 1.0, 'cycle': False, 'name': None}
{'name': 'Adam', 'learning_rate': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 0.0, 'decay_steps': 100, 'end_learning_rate': 0.001, 'power': 1.0, 'cycle': False, 'name': None}}, 'decay': 0.0, 'beta_1': 0.99, 'beta_2': 0.1, 'epsilon': 1e-06, 'amsgrad': False}

To implement this in pytorch, modify the existing optimizer save and load code to be used as follows.

>>> optimizer_manager.save(optimizer, scheduler)
>>> optimizer_manager.load_with_scheduler(optimizer, scheduler, step=938)

dataloader raises pickle error when it is used in a process

Another bug(?) report here. I'm using PyTorch DDP so used the matorage.torch.Dataset in a forked process (multiprocessing forkserver is set). Of course the dataset was initialized in run() function to avoid unexpected pickle error. However, I still get an error related to the picklization:

Process TrainProcess-1:
Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 212, in run
    train_loss_per_target, ave_train_loss = self.train_epoch(epoch)
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 282, in train_epoch
    for i, data in enumerate(self.train_loader):
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 737, in __init__
    w.start()
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'

I have no idea what the WeakValueDictionary means. Did you test the use of Dataset and the resultant DataLoader in a multiprocessing environment?

Thank you again!

When there are many cores during multi-processing, the metadata and the actually uploaded file are not synced.

I will write test code related to this and conduct experiments.

To implement it correctly, you should check https://github.com/graykode/matorage/blob/master/matorage/uploader.py#L41 to see if the upload is a success(200) code and put it in metadata.json only when it succeeds.

support inference large models such as gpt-3 in storage calculation.

In deep learning, the popularity of large models (gpt-3, T5, megatron LM) is growing. However, due to this, the polarization of wealth in AI is intensifying.

As an example that touches very well, take gpt-3, a recently very hot potato. gpt-2 was 6GB on disk and the number of parameters was 1.5B. However, since gpt-3 has 175B parameters, it is assumed that its weight alone will occupy 700GB.

To train or inference through the existing framework, all weights had to be loaded into memory. However, in the case of gpt-3, it is difficult to use 700GB of memory on a general PC.

But matorage can solve this problem. The philosophy of matorage's model storage is not to store one model as a single file, but to store it layer-wise. Therefore, matorage will solve this problem by fetching only the submodel weight acceptable to the PC, loading it into memory, and storing the calculated value in file storage. It has a similar philosophy to pydata/numexpr.

The implementation of this feature is reflected in 0.3.0. In addition, we will implement operations that forward rather than backward and are released first in the pytorch version.
Once again, I hope that the future of AI will not be centralized by wealth, but decentralized by collective intelligence.

If you want to know more, please refer to the issue :
#openai/gpt-3/issues/1
#huggingface/transformers/issues/4658

Note
This issue is not using the official gpt-3 weights. Run the test by randomly initializing the model with the same conditions as shown in the picture below.

attribute with string type

Hi @graykode!

One question to save dataset including any str type vector. I have a dataset having string type identifiers. These identifiers aren't used in the training or the prediction calculation but used in analyze of the prediction, especially for the outliers. Of course I can make any converting integer vector to map the integer id to the string identifier, but I found the DataAttribute supports string type, so tried to use it. Here the code I wrote is:

        data_config = nas.DataConfig(
            endpoint="127.0.0.1:9000",
            access_key="...",
            secret_key="...",
            dataset_name="...",
            additional={
                "dataset": "train",
                "len": len(d_list),
                "framework": "pytorch",
                "dttm": tz.localtime().strftime("%Y%m%d_%H%M%S%z")
            },
            compressor={
                "complevel" : 9,
                "complib" : "zlib",
            },
            attributes=[
                nas.DataAttribute('id', 'string', (1, ), 30),
                nas.DataAttribute('fp', 'bool', (IN_DIM, )),
                nas.DataAttribute('target', 'float32', (OUT_DIM, )),
            ]
        )

        data_saver = nas.DataSaver(config=data_config, refresh=True)

        for x in tqdm(d_list, dynamic_ncols=True, desc=d_name):
            key = x.get('id')
            feat = x.get('fp')
            target = x.get('target')
            data_saver({
                "id": key,
                "fp": feat,
                "target": target,
            })

but I got an error as:

Traceback (most recent call last):                                                                                                                                                                                                                                              
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                                           
    return _run_code(code, main_globals, None,                                                                                                                                                                                                                                  
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                                      
    exec(code, run_globals)                                                                                                                                                                                                                                                     
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 204, in <module>                                                                                                                                                                                              
    store_to_matorage(ds)                                                                                                                                                                                                                                                       
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 197, in store_to_matorage                                                                                                                                                                                     
    save_data_list(train_list, "train")                                                                                                                                                                                                                                         
  File "/mnt/ssd2/works/kyulux/kyumlm/tddft/ann/feature.py", line 190, in save_data_list                                                                                                                                                                                        
    data_saver({                                                                                                                                                                                                                                                                
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 317, in __call__                                                                                                                                                         
    self._check_data_numpytype()                                                                                                                                                                                                                                                
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/saver.py", line 257, in _check_data_numpytype                                                                                                                                            
    raise TypeError("I suspect you need to set the filetype.")                                                                                                                                                                                                                  
TypeError: I suspect you need to set the filetype.

I have no idea what the filetype means here, so I'd like to ask your help. Could you let me know how to use string type vector as a part of my dataset?
Thank you in advance!

How to save and load .pt pytorch model (to and from Minio) using your library?

Hello,
Thank you for sharing this library. I appreciate if you could guide me with a simple example about how to save .pt model to Minio and then how to load it from Minio to do prediction using your library. The existing examples explain how to save and test the model during training only. Can I use the library to perform only prediction by loading the model from Minio, since I have already trained my model and I need just to use the library to simplify the process of loading the model from Minio and do prediction? Thank you.

no metadata dir in a compressed bucket

Hi again,

Sorry for bothering you with several question and bug report, but this looks critical.
I made a compressed data bucket and it looks storing well, but when I retrieve the dataset, it has 0 len as follows:

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 88, in set_datasets
    print(dataset[0])
  File "/home/jinserk/kyu/kyumlm/tddft/ann/dataset.py", line 35, in __getitem__
    x = super().__getitem__(index)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 81, in __getitem__
    return self._get_item_with_download(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 89, in _get_item_with_download
    _objectname, _relative_index = self._find_object(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 128, in _find_object
    _key = self.end_indices[_key_idx]
IndexError: list index out of range

I've checked briefly, and found that the bucket has no metadata to read out the meta info of the dataset.
Can you fix this error? I have installed the latest master branch code.

Feature for storing not only 'numpy arrays' but also file format required for training

Many deep learning tasks require file as well as numpy arrays to evaluate the trained model.
These formats are applicable :

.txt
.jpg
.png
....

Therefore, it is implemented so that it can be used as above through data saver :

data_config = DataConfig(
        endpoint="127.0.0.1:9000",
        access_key="minio",
        secret_key="miniosecretkey",
        dataset_name="mnist",
        attributes=[
            ("input_ids", "int64", (384)),
       ],
    )
data_saver = DataSaver(config=data_config, refresh=True)

# general data saver for saving numpy array
data_saver({
      "input_ids": batch[0],
}, filetype=False)
data_saver.disconnect()

# file data saver
data_saver({
      "key": `file path`,
}, filetype=True)
data_saver.disconnect()

ToDo

Add local cache mapper for filetype dataset.