GithubHelp home page GithubHelp logo

A few questions for the usage about matorage HOT 7 OPEN

jinserk avatar jinserk commented on June 14, 2024 1
A few questions for the usage

from matorage.

Comments (7)

graykode avatar graykode commented on June 14, 2024 3

@jinserk

I understand that there is currently no official support for hdf5 for sparse matrices.(This is not impossible to implement. Actually, there is currently an implementation such as https://github.com/appier/h5sparse) Therefore, the official pytable document also recommends compressing for sparse matrix.

In fact, according to many sources, it is recommended to use compression when using the hdf5 format. (https://stackoverflow.com/a/25678471/5350490) According to this, since it is a sparse matrix, the original 512MB of data can be compressed up to 4.5KB. So, can you experiment with your 400GB data and give you the final compressed result size? Please try to compression='gzip' with level=9 and let me know how many sizes are compressed!

In addition, apart from this, we will add a mechanism for sparse to our long-term plans!!

from matorage.

jinserk avatar jinserk commented on June 14, 2024 1

Thanks for the suggestion, @graykode ! I will try to do so, since I don't know how the "compression" will work well. I had once tried to store the fixed tensors but the serialized file was almost 400 GB (used torch.save) while the file with sparse tensor was only 4 GB. I still hope the storing sparse tensor will be supported in this matorage project soon. :)

from matorage.

graykode avatar graykode commented on June 14, 2024

@jinserk Thank you for your interest in the project!

  • can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?: As can be seen in the existing example, tensors in the form of tuple or dict can also be stored. We do not define a new operation for sparse sensors, but we can save it. As for the tensor for sparse, we will add it to the long-term plan. Below is an example of storage in the form of a tuple :
attributes=[
        ('image', 'float32', (1, 28, 28)),
        ('target', 'int64', (1))
    ]

traindata_saver({
        'image': image,
        'target': target
    })
  • If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it? : If you simply add more data (append mode) does not matter if you save using the existing config. However, if you want to refresh your data, it is not currently implemented in code. If you want to refresh dataset (this means the same as you want to remove buckets), you should use minio web site or mc command from minio. (mc rb --force --dangerous local/<bucket_name>). I will further implement the refresh method by adding a new option on datasaver like this:
traindata_saver({
        'image': image,
        'target': target
    }, refresh=True) # I will add this argument
  • If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not. :I understood this question as follows: With distributed data parallel (DDP), N data is replicated as many nodes as M, and the total number of N*M data is not generated? Yes, that's right. When dataset of torch.matorage is initialized, it goes through a logic to pre-download data corresponding to config (https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L80). However, this is very inefficient, So we recommend using the network access storage (NAS) option when using DDP. No new downloads of data are available when using NAS. Therefore, N datasets can be kept intact and DDP train can be performed.
  • It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well? : When the tensor in Pytorch(torch.tensor) and Tensorflow(tf.Tensor) are saved, they are converted to numpy and saved. (As with tfrecord, the tensor can be encoded in the proto-buf format, but because of its lack of universality, the most popular numpy format was used.) Therefore, if the storage format is numpy array, you can save both scikit-learn or XGBOost. But of course, when it comes to data loaders, new additional implementations are needed. See how numpy array to be save! : (https://github.com/graykode/matorage/blob/master/tests/test_datasaver.py#L247)

Best Regard Tae Hwan

from matorage.

jinserk avatar jinserk commented on June 14, 2024

@graykode Thank you very much for the detailed answers! It's really helpful, and very impressive.

My first question is actually related to the heterogeneous shape of tensors, which means the case that the image size, in your mnist example, could be changed as sample by sample. Practically I'm working on a chemical problem -- molecule classification in chemistry or pharma companies -- and the input feature can be graphs whose sizes are various according to the molecules. I know this cannot be simply implemented using an attributes, and that's why I asked about the sparse matrices support. Hope this could be implemented and be used sooner! :)

from matorage.

graykode avatar graykode commented on June 14, 2024

@jinserk
Matrices with atypical shapes are difficult to store regardless of sparse. Moreover, sparse is not difficult to implement because it is guaranteed by through scipy(https://github.com/appier/h5sparse). However, it is very difficult to store a tensor with an undefined shape as hdf5.

I have a question, in order to make a model of pytorch, all input shapes must be the same, but I am curious how a tensor of heterogeneous shape can be an input of a model.

from matorage.

jinserk avatar jinserk commented on June 14, 2024

Good question. Basically I'm using a fixed shape for the input of the model. In the training, I just pad the heterogeneous shapes of the input with a fixed shape of max dim values. I made a quick test to store all my dataset in the form of padded dense matrices and got almost 100 times bigger stored file, which is totally impractical.

from matorage.

graykode avatar graykode commented on June 14, 2024

If so, how about storing the fixed tensor itself that goes into the model's input in matorage? This is the core idea of matorage.
Also using high compressor leve(7~9) can helps sparse matrix to store better.

from matorage.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.