GithubHelp home page GithubHelp logo

Comments (6)

dwf avatar dwf commented on June 18, 2024

Yeah, it is a careful dance we have to do, mutating a self.classvar with
methods or += works but assigning to it makes it an instance attribute,
which is annoying. Maybe we just need to adopt a convention of always using
self.class so these errors don't creep in again.

from fuel.

dwf avatar dwf commented on June 18, 2024

Oh oops, I misread.

Maybe we need a proxy for the state that is not the HDF5 file handle itself.

from fuel.

jfsantos avatar jfsantos commented on June 18, 2024

What about keeping the handle together with the counter in a tuple inside the ref_counts dict and using handle.id as the state? I think that would actually cause the class to share file handles, as the way the class is written today, if one creates two streams accessing the same dataset, they will each have a different file handle: they call open which calls _out_of_memory_open, which is currently always creating a new handle by calling h5py.File.

EDIT: Sent a PR with a tentative implementation for this.

Would you mind commenting on this, @vdumoulin?

from fuel.

bartvm avatar bartvm commented on June 18, 2024

h5py.File doesn't always create a new handle. If you pass it an existing handle (as is done in the current implementation) it will simply return this, so currently classes already share a handle.

Passing around the file id as state instead of the handle itself sounds like a good solution though.

from fuel.

jfsantos avatar jfsantos commented on June 18, 2024

Right, somehow I overlooked this detail. However, h5py.h5f.FileID is also not picklable. This is the reason I decided to use the file path as state instead of the handle. I changed a few things to make this work but it's kind of messy, so we could just keep everything else as it was (ref_counts storing handles as keys and counts as values) and just add a method _get_handle to get the right handle from the keys list based on the file path (if it exists). A problem is that one would have to look for it every time, but this is what was being done in _get_file_id anyway. What do you think?

from fuel.

vdumoulin avatar vdumoulin commented on June 18, 2024

@Thrandis This is your ticket this week. Here's a quick recap of the issue:

In order to enable multiple instances of a dataset to exist at the same time (e.g. having a training set and a validation set), we have to juggle with file handles a bit since h5py doesn't play well with having a file opened multiple times.

The solution we implemented is to keep a reference counter as a class attribute, H5PYDataset.ref_counts, which is a dict mapping FileID instances to a reference count. The first time a given file is opened, it is opened as usual by passing a path to the h5py.File constructor. The FileID instance corresponding to this opened file is added to H5PYDataset.ref_counts with a reference count of 1. On subsequent opens, this FileID object is retrieved and is passed to h5py.File instead of the path. When H5PYDataset.close is called, the reference count is decreased and if it falls to 0, the file handle is properly closed.

In parallel to that, when a Dataset subclass opens a file from disk, it may need to maintain a state for that particular file handle (e.g. where we're at when reading a file sequentially). This is the state object that is returned by Dataset.open and that is taken as argument for Dataset.get_data. For H5PYDataset we use the file handle as state.

This is were the problem resides: some classes such as DataStream will pickle this state object, which fails for h5py file handles. We need to use something that's picklable as the state while still retaining the ability to have many open copies of the same file.

Here's what I suggest:

  • The state passed around is a file path (using the full path for uniqueness).
  • We maintain a private _ref_counts class attribute which is a dict mapping file paths to reference counts.
  • We also maintain a private _file_handles class attribute which is a dict mapping file paths to HDF5 file handles.
  • We implement get_file_handle and release_file_handle class methods which are responsible for maintaining the state of _ref_counts and _file_handles.

from fuel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.