GithubHelp home page GithubHelp logo

JOSS - Functionality about mmappickle HOT 6 CLOSED

unine-chyn avatar unine-chyn commented on June 19, 2024
JOSS - Functionality

from mmappickle.

Comments (6)

lfasnacht avatar lfasnacht commented on June 19, 2024

Hello!
Thank you very much for agreeing to review. Here's a quick answer, I can get deeper in details if required (but I need to get to know more about joblib), let me know.

joblib shares multiple goals with mmappickle. Here are what I think are the main differences with joblib, as far as I know (feel free to correct me if needed):

  • joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).
  • joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files), and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.
  • concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

I hope that make sense!

from mmappickle.

glemaitre avatar glemaitre commented on June 19, 2024

joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).

I think this is not the case anymore (from 0.10.0).

joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files).

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.

This is a complicated matter, IMO. Versioning will always be a trouble when dealing with pickle (Python version, pickle protocol version, library version -> the way that data structure are stored). However, this is true that being compatible with pickle library is a plus at least at time t :)

concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

IMO, here is the most important. I will go through the code but I would expect a benchmark regarding the pickling/unpickling. It could be done on different size of arrays and data structure.

We could start with LFW dataset with the available gist there:
https://gist.github.com/aabadie/2ba94d28d68f19f87eb8916a2238a97c

from mmappickle.

lfasnacht avatar lfasnacht commented on June 19, 2024

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

Sorry, I was not aware of that, however mmappickle is made to store a dictionary of things. If I try to do the same with joblib, let say I have the following case:

import joblib, numpy
d={'a':numpy.array([1,2,3]),'b':numpy.array([4,5,6]), 'c': numpy.ma.array([7,8,9],mask=[False, True, False])}
joblib.dump(d, '/tmp/out.pkl')
x=joblib.load('/tmp/out.pkl',mmap_mode='r+')

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

from mmappickle.

glemaitre avatar glemaitre commented on June 19, 2024

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

from mmappickle.

glemaitre avatar glemaitre commented on June 19, 2024

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

from mmappickle.

lfasnacht avatar lfasnacht commented on June 19, 2024

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

That doesn't seem that efficient, especially as all the matrices have to be written again to disk?

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

Exactly.

Here's the simplest use case I can think of: hyperspectral time lapses. At every time step, an image is captured, but we don't know how many images will be captured in total. An image is 3D numpy array of a few hundred megabytes usually, and its size is known beforehand (it consists in multiple camera frames). There are lot of advantages to hold all the images in the same files, instead of having a lot of different files (for example all the common metadata are stored only once), and it is not possible to hold all images in RAM simultaneously (usually we use a laptop to capture data).

From the mmappickle point of view, here's what happens:

  1. at first, the file is created, and the metadata is written.
  2. at the beginning of each scan, a new key is added to the dict. This doesn't take too much time as only the matrix structure and a "hole" is written in the file.
  3. each frame is written to the (now memmap'ed) array (filling the "hole")
  4. then go to step 2, as long as not stopped by the user.

The nice thing of using mmappickle is that it is possible to view the content of the file while it is written, or even use it from another program (for example to check if everything looks right, or to run an algorithm on the first captured images while still capturing new data). It is the kind of simultaneous access which is not considered by other libraries.

That's why the common "pickling all data at once"-benchmark doesn't really make sense for mmappickle. The common use case it random access (add/delete/memmap) of the dict.

Does that make sense to you?

from mmappickle.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.