GithubHelp home page GithubHelp logo

neuro-ml / connectome Goto Github PK

View Code? Open in Web Editor NEW
12.0 7.0 1.0 2.82 MB

A library for datasets containing heterogeneous data

Home Page: https://neuro-ml.github.io/connectome/

License: Apache License 2.0

Python 100.00%
python data-processing pipelines

connectome's Introduction

docs codecov pypi License PyPI - Downloads

Connectome is a framework for datasets management with strong emphasis on simplicity, composability and reusability.

Features

  • Self-consistency: connectome encourages data transformations that keep entries' fields consistent
  • Caching: transformations' caching works out of the box and supports both caching to RAM and to Disk
  • Automatic cache invalidation: connectome tracks all the changes made to a dataset and automatically invalidates the cache when something changes, making sure that your cache is always consistent with the data
  • Invertible transformations: write consistent pre- and post- processing to build production-ready pipelines

Install

The simplest way is to get it from PyPi:

pip install connectome

Or if you want to try the latest version from GitHub:

git clone https://github.com/neuro-ml/connectome.git
cd connectome
pip install -e .

# or let pip handle the cloning:
pip install git+https://github.com/neuro-ml/connectome.git

Getting started

The docs are located here

Also, you can check out our Intro to connectome series of tutorials here

Acknowledgements

Some parts of our automatic cache invalidation machinery vere heavily inspired by the cloudpickle project.

connectome's People

Contributors

alimbfromlimb avatar ganddalf avatar maxme1 avatar mishgon avatar samokhinv avatar stanislaushimovolos avatar stnld2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

samokhinv

connectome's Issues

Need a FrozenChain

This will allow stuff like dataset[1:] even if it starts with e.g. a caching block.

Add __init__ support

Users should be able to define __init__ with custom arguments validation logic.
self will only support setting and getting attributes from a limited set of names, i.e. node names.

Add `_drop_ram_cache`

Drop only ram cache layers from the dataset. Might be useful for switching between machines with high and low amount of RAM.

Caching is slower than it should

This is because each time we access the cache we need to calculate the hashes for each node.
Moreover if there is caching to disk, it will check each time whether the cache is present.

More serializers

  • numpy with gzip
  • pickle
  • json
  • compound stuff like dicts of anything from above

Pickling functions

Currently while pickling code objects, co_filename is also saved. Should we remove it?
This is a problem for functions, defined inside configs, mostly lambdas.

Clean ram cache when it becomes useless

Example: after training I don't need to keep the training data in ram anymore, but I may still need some ram for testing purposes. If I could free ram, the experiment would require less memory.

Improve caching to disk

  • Multiple storage folders
  • Track disk usage
  • Save additional metadata: timestamp, current user, project name, other user-defined json data
  • Keep folders permissions and ownership

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.