Comments (4)
Here's an idea: we could require that all datasets provide a list of files in the $FUEL_DATA_PATH
root directory which they need as their data source along with checksums for these files. Optionally, the datasets could also point to a method that can be used to retrieve missing files.
We would then write a module that checks whether all required files for a given dataset are available, and do one of the following:
- If all files are present and their checksums check out, everything's fine.
- If some existing files' checksums do not check out, rename them (e.g. by appending
".old"
) and attempt to re-download them. - If some files are not present, attempt to download them.
- If there's not script provided to download missing or damaged files, raise an error.
Some of the module's behaviour could be configured. For instance, files that do not check out could raise a warning without being re-downloaded, and that warning could also completely be disabled.
from fuel.
I think that pretty much covers everything. I agree that it's important that everything should be configurable: Whether datasets should be automatically downloaded, whether or not to accept files when the checksum doesn't match, etc.
Downloading should be pretty straightforward with e.g. the requests library. For some datasets we might want to have checksums without being able to provide a download source though (e.g. for datasets that aren't public, like Penn Treebank).
One question to raise is whether we should also automatically process the data. For MNIST we can just read the image files directly, because there is little overhead, but for larger image datasets it might make sense to load them into an HDF5 file once, and from thereon read that file instead. Should we do that automatically? It might be harder to checksum these files, because a different h5py
version might just result in a different file.
from fuel.
Are fuel-download
and fuel-convert
considered sufficient to solve this issue, or do we still want fully-automated downloading?
from fuel.
Nope, I think this is good enough. Doing everything automagically just makes things needlessly complicated (e.g. what if you launch multiple jobs and they all start downloading simultaneously, or it might just end up downloading things whenever you set the data path incorrectly).
from fuel.
Related Issues (20)
- KeyError: "Unable to open object (Object 'image_features' doesn't exist)" HOT 1
- Fixed HOT 1
- Built-in datasets: Convert still fails HOT 4
- Add support to make bucket to variable length data HOT 2
- Fuel Dataset Import error HOT 1
- Error when unpickling TextFile with text using encoding: "maximum recursion depth exceeded"
- Mapping won't work with mapping_accepts=dict and add_sources HOT 2
- Unicode error/crash HOT 3
- HDF5 version of ImageNet (ilsvrc 2012) and CIFAR-10 datasets. HOT 1
- Search over documentation gives wrong links
- ServerDataStream example is outdated: argument is missing
- CelebA Dataset: dropbox unstable HOT 2
- The installation process can't find build_ext. HOT 3
- pip install git+https://github.com/mila-udem/fuel.git@stable HOT 1
- [Feature Request] option to make batch size fixed HOT 1
- ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory
- Installation setup.py error on Mac HOT 1
- I downloaded fuel from git and used this command to install it error when I installed fuel
- I downloaded fuel from git and used this command to install it "python setup.py install" but I got this error HOT 2
- Could you offer the whl binary file of the fuel on windows?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuel.