Comments (6)
Yeah, it is a careful dance we have to do, mutating a self.classvar with
methods or += works but assigning to it makes it an instance attribute,
which is annoying. Maybe we just need to adopt a convention of always using
self.class so these errors don't creep in again.
from fuel.
Oh oops, I misread.
Maybe we need a proxy for the state that is not the HDF5 file handle itself.
from fuel.
What about keeping the handle together with the counter in a tuple inside the ref_counts
dict and using handle.id
as the state? I think that would actually cause the class to share file handles, as the way the class is written today, if one creates two streams accessing the same dataset, they will each have a different file handle: they call open
which calls _out_of_memory_open
, which is currently always creating a new handle by calling h5py.File
.
EDIT: Sent a PR with a tentative implementation for this.
Would you mind commenting on this, @vdumoulin?
from fuel.
h5py.File
doesn't always create a new handle. If you pass it an existing handle (as is done in the current implementation) it will simply return this, so currently classes already share a handle.
Passing around the file id as state instead of the handle itself sounds like a good solution though.
from fuel.
Right, somehow I overlooked this detail. However, h5py.h5f.FileID
is also not picklable. This is the reason I decided to use the file path as state instead of the handle. I changed a few things to make this work but it's kind of messy, so we could just keep everything else as it was (ref_counts storing handles as keys and counts as values) and just add a method _get_handle
to get the right handle from the keys list based on the file path (if it exists). A problem is that one would have to look for it every time, but this is what was being done in _get_file_id
anyway. What do you think?
from fuel.
@Thrandis This is your ticket this week. Here's a quick recap of the issue:
In order to enable multiple instances of a dataset to exist at the same time (e.g. having a training set and a validation set), we have to juggle with file handles a bit since h5py
doesn't play well with having a file opened multiple times.
The solution we implemented is to keep a reference counter as a class attribute, H5PYDataset.ref_counts
, which is a dict mapping FileID
instances to a reference count. The first time a given file is opened, it is opened as usual by passing a path to the h5py.File
constructor. The FileID
instance corresponding to this opened file is added to H5PYDataset.ref_counts
with a reference count of 1. On subsequent opens, this FileID
object is retrieved and is passed to h5py.File
instead of the path. When H5PYDataset.close
is called, the reference count is decreased and if it falls to 0, the file handle is properly closed.
In parallel to that, when a Dataset
subclass opens a file from disk, it may need to maintain a state for that particular file handle (e.g. where we're at when reading a file sequentially). This is the state
object that is returned by Dataset.open
and that is taken as argument for Dataset.get_data
. For H5PYDataset
we use the file handle as state
.
This is were the problem resides: some classes such as DataStream
will pickle this state
object, which fails for h5py
file handles. We need to use something that's picklable as the state
while still retaining the ability to have many open copies of the same file.
Here's what I suggest:
- The
state
passed around is a file path (using the full path for uniqueness). - We maintain a private _ref_counts class attribute which is a dict mapping file paths to reference counts.
- We also maintain a private _file_handles class attribute which is a dict mapping file paths to HDF5 file handles.
- We implement get_file_handle and release_file_handle class methods which are responsible for maintaining the state of _ref_counts and _file_handles.
from fuel.
Related Issues (20)
- KeyError: "Unable to open object (Object 'image_features' doesn't exist)" HOT 1
- Fixed HOT 1
- Built-in datasets: Convert still fails HOT 4
- Add support to make bucket to variable length data HOT 2
- Fuel Dataset Import error HOT 1
- Error when unpickling TextFile with text using encoding: "maximum recursion depth exceeded"
- Mapping won't work with mapping_accepts=dict and add_sources HOT 2
- Unicode error/crash HOT 3
- HDF5 version of ImageNet (ilsvrc 2012) and CIFAR-10 datasets. HOT 1
- Search over documentation gives wrong links
- ServerDataStream example is outdated: argument is missing
- CelebA Dataset: dropbox unstable HOT 2
- The installation process can't find build_ext. HOT 3
- pip install git+https://github.com/mila-udem/fuel.git@stable HOT 1
- [Feature Request] option to make batch size fixed HOT 1
- ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory
- Installation setup.py error on Mac HOT 1
- I downloaded fuel from git and used this command to install it error when I installed fuel
- I downloaded fuel from git and used this command to install it "python setup.py install" but I got this error HOT 2
- Could you offer the whl binary file of the fuel on windows?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuel.