neuro-ml / connectome Goto Github PK

View Code? Open in Web Editor NEW

12.0 7.0 1.0 2.82 MB

A library for datasets containing heterogeneous data

Home Page: https://neuro-ml.github.io/connectome/

License: Apache License 2.0

Python 100.00%

python data-processing pipelines

connectome's Introduction

Connectome is a framework for datasets management with strong emphasis on simplicity, composability and reusability.

Features

Self-consistency: connectome encourages data transformations that keep entries' fields consistent
Caching: transformations' caching works out of the box and supports both caching to RAM and to Disk
Automatic cache invalidation: connectome tracks all the changes made to a dataset and automatically invalidates the cache when something changes, making sure that your cache is always consistent with the data
Invertible transformations: write consistent pre- and post- processing to build production-ready pipelines

Install

The simplest way is to get it from PyPi:

pip install connectome

Or if you want to try the latest version from GitHub:

git clone https://github.com/neuro-ml/connectome.git
cd connectome
pip install -e .

# or let pip handle the cloning:
pip install git+https://github.com/neuro-ml/connectome.git

Getting started

The docs are located here

Also, you can check out our Intro to connectome series of tutorials here

Acknowledgements

Some parts of our automatic cache invalidation machinery vere heavily inspired by the cloudpickle project.

connectome's People

Contributors

Stargazers

Watchers

Forkers

samokhinv

connectome's Issues

Make graph calculations non-recursive

The order of computation can be determined right before caches calculation. This will hide a huge non-informative traceback

Cache layers don't respect persistent nodes

Weird cache invalidation

The cache gets invalidated if source receives an np.str_ input instead of str.

Rewrite `backward` support

Add support for arguments that don't invalidate the cache

Need a FrozenChain

This will allow stuff like dataset[1:] even if it starts with e.g. a caching block.

Wrap socket.gaierror

Add remote storage support

Inheritance might impede wrapping

If a layer A inherits a value V and a layer B expects V as input, A >> B will raise an error.

Remove the call method from optional

Add init support

Users should be able to define __init__ with custom arguments validation logic.
self will only support setting and getting attributes from a limited set of names, i.e. node names.

Add more contextual information to errors

Need a better way to mark dynamic modules

Thoughts:

modules should be allowed to mark themselves as dynamic
user code should be allowed to mark other modules as dynamic

Fix compatibility with newer cloudpickle versions

Add `exclude`

Which is opposite of __inherit__

Move node digest to edge's payload

Add mixins support

As a simple way of code reuse without complicated inheritance logic.

Implement `getitem` for Instance objects

Make Chained impossible to inherit from?

Add a plain transform class that can be inherited from

Add `_drop_ram_cache`

Drop only ram cache layers from the dataset. Might be useful for switching between machines with high and low amount of RAM.

Caching is slower than it should

This is because each time we access the cache we need to calculate the hashes for each node.
Moreover if there is caching to disk, it will check each time whether the cache is present.

Replace assertions by exceptions

This will deliver more relevant information

Need something similar to `_getstate__` to let objects manage their versioning.

Some objects already have some sort of versioning support, which might help with false-positive cache invalidation due to insignificant changes.

Hide pickler's traceback

More serializers

numpy with gzip
pickle
json
compound stuff like dicts of anything from above

Add str and repr methods to blocks

Add sharding to CacheColumns

Better error message for node duplicates

Add callable default arguments

Merge with impure edges still leads to inconsistent outputs

Make storage entry read-only right after being added

Pickling functions

Currently while pickling code objects, co_filename is also saved. Should we remove it?
This is a problem for functions, defined inside configs, mostly lambdas.

Caching needs synchronization

If the cache is being filled in different threads / processes this often leads to failure.

The first argument in a source method can't be Local

But it should be able to be Local

Need a metaclass for mixins

Add placeholder fields

Automatically generate stubs for class-based transforms

Persistent fields may randomly disappear from Chain

Add another hash check after the file is written to the storage

Unify the Layer and Container interface

Add possibility to access methods from the same layer

This will avoid boilerplate such as

def _shape(...):
      ....

def shape(_shape):
    return _shape

def image(image, _shape):
       ....

Multiple storage folders
Track disk usage
Save additional metadata: timestamp, current user, project name, other user-defined json data
Keep folders permissions and ownership