activeloopai / deeplake Goto Github PK

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page: https://activeloop.ai

License: Mozilla Public License 2.0

Dockerfile 0.01% Python 99.98% Shell 0.01%

datasets deep-learning machine-learning data-science pytorch tensorflow data-version-control python ai ml

deeplake's Introduction

Deep Lake: Database for AI

Docs • Get Started • API Reference • LangChain & VectorDBs Course • Blog • Whitepaper • Slack • Twitter

Read this in other languages: 简体中文

What is Deep Lake?

Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for:

Storing data and vectors while building LLM applications
Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

Deep Lake includes the following features:

Multi-Cloud Support (S3, GCP, Azure)

Use one API to upload, download, and stream datasets to/from S3, Azure, GCP, Activeloop cloud, local storage, or in-memory storage. Compatible with any S3-compatible storage such as MinIO.

Native Compression with Lazy NumPy-like Indexing

Store images, audio, and videos in their native compression. Slice, index, iterate, and interact with your data like a collection of NumPy arrays in your system's memory. Deep Lake lazily loads data only when needed, e.g., when training a model or running queries.

Dataset Version Control

Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well!

Dataloaders for Popular Deep Learning Frameworks

Deep Lake comes with built-in dataloaders for Pytorch and TensorFlow. Train your model with a few lines of code - we even take care of dataset shuffling. :)

Integrations with Powerful Tools

Deep Lake has integrations with Langchain and LLamaIndex as a vector store for LLM apps, Weights & Biases for data lineage during model training, and MMDetection for training object detection models.

100+ most-popular image, video, and audio datasets available in seconds

Deep Lake community has uploaded 100+ image, video and audio datasets like MNIST, COCO, ImageNet, CIFAR, GTZAN and others.

Instant Visualization Support in the Deep Lake App

Deep Lake datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Deep Lake Visualizer (see below).

🚀 Performance

Deep Lake's performant dataloader built in C++ speeds up data streaming by >2x compared to Hub 2.x (Ofeidis et al. 2022, Hambardzumyan et al. 2023)

🚀 How to install Deep Lake

Deep Lake can be installed using pip:

pip3 install deeplake

By default, Deep Lake does not install dependencies for audio, video, google-cloud, and other features. Details on all installation options are available here.

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

Using Deep Lake as a Vector Store for building LLM applications:

- Vector Store Quickstart

- Vector Store Getting Started Guide

- Using Deep Lake with LangChain

- Image Similarity Search with Deep Lake

Deep Learning Applications

Using Deep Lake for managing data while training Deep Learning models:

- Deep Learning Quickstart

- Deep Learning Getting Started Guide

- Tutorials for Training Models

- Tutorials for Creating Deep Learning Datasets

- Deep Learning Playbooks

⚙️ Integrations

Deep Lake offers integrations with other tools in order to streamline your deep learning workflows. Current integrations include:

Model Training
- Stream data while training thousands of pre-built models using MMDetection, a popular open-source object detection toolbox based on PyTorch. Learn more in this tutorial.
Experiment Tracking
- Track experiments and achieve full model reproducibility using Deep Lake and Weights & Biases. Our integration automatically pushes dataset-related information (uri, commit hash, view id) to your W&B runs. Further details are available in our model-reproducibility playbook.
LLM Apps
- Use Deep Lake as a vector store for LLM apps. Our integration combines the Langchain VectorStores API with Deep Lake datasets as the underlying data storage. The integration is a serverless vector store that can be deployed locally or in a cloud of your choice.

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Deep Lake users can access and visualize a variety of popular datasets through a free integration with Deep Lake's App. Universities can get up to 1TB of data storage and 100,000 monthly queries on the Tensor Database for free per month. Chat in on our website: to claim the access!

👩‍💻 Comparisons to Familiar Tools

Deep Lake vs Chroma

Both Deep Lake & ChromaDB enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. ChromaDB is a Vector Database that can be deployed locally or on a server using Docker and will offer a hosted solution shortly. Deep Lake is a serverless Vector Store deployed on the user’s own cloud, locally, or in-memory. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike ChromaDB, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. ChromaDB is limited to light metadata on top of the embeddings and has no visualization. Deep Lake datasets can be visualized and version controlled. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Pinecone

Both Deep Lake and Pinecone enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Pinecone is a fully-managed Vector Database that is optimized for highly demanding applications requiring a search for billions of vectors. Deep Lake is serverless. All computations run client-side, which enables users to get started in seconds. Unlike Pinecone, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Pinecone is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Weaviate

Both Deep Lake and Weaviate enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Weaviate is a Vector Database that can be deployed in a managed service or by the user via Kubernetes or Docker. Deep Lake is serverless. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike Weaviate, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Weaviate is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs DVC

Deep Lake and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Deep Lake converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Deep Lake format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Deep Lake is a Python package. Lastly, Deep Lake offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Deep Lake vs MosaicML MDS format

Data Storage Format: Deep Lake operates on a columnar storage format, whereas MDS utilizes a row-wise storage approach. This fundamentally impacts how data is read, written, and organized in each system.
Compression: Deep Lake offers a more flexible compression scheme, allowing control over both chunk-level and sample-level compression for each column or tensor. This feature eliminates the need for additional compressions like zstd, which would otherwise demand more CPU cycles for decompressing on top of formats like jpeg.
Shuffling: MDS currently offers more advanced shuffling strategies.
Version Control & Visualization Support: A notable feature of Deep Lake is its native version control and in-browser data visualization, a feature not present for MosaicML data format. This can provide significant advantages in managing, understanding, and tracking different versions of the data.

Deep Lake vs TensorFlow Datasets (TFDS)

Deep Lake and TFDS seamlessly connect popular datasets to ML frameworks. Deep Lake datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Deep Lake and TFDS is that Deep Lake datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Deep Lake, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Deep Lake also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Deep Lake vs HuggingFace

Deep Lake and HuggingFace offer access to popular datasets, but Deep Lake primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Deep Lake.

Deep Lake vs WebDatasets

Deep Lake and WebDatasets both offer rapid data streaming across networks. They have nearly identical steaming speeds because the underlying network requests and data structures are very similar. However, Deep Lake offers superior random access and shuffling, its simple API is in python instead of command-line, and Deep Lake enables simple indexing and modification of the dataset without having to recreate it.

Deep Lake vs Zarr

Deep Lake and Zarr both offer storage of data as chunked arrays. However, Deep Lake is primarily designed for returning data as arrays using a simple API, rather than actually storing raw arrays (even though that's also possible). Deep Lake stores data in use-case-optimized formats, such as jpeg or png for images, or mp4 for video, and the user treats the data as if it's an array, because Deep Lake handles all the data processing in between. Deep Lake offers more flexibility for storing arrays with dynamic shape (ragged tensors), and it provides several features that are not naively available in Zarr such as version control, data streaming, and connecting data to ML Frameworks.

Community

Join our Slack community to learn more about unstructured dataset management using Deep Lake and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Deep Lake.

README Badge

Using Deep Lake? Add a README badge to let everyone know:

[![deeplake](https://img.shields.io/badge/powered%20by-Deep%20Lake%20-ff5a1f.svg)](https://github.com/activeloopai/deeplake)

Disclaimers

Dataset Licenses

Deep Lake users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Usage Tracking

By default, we collect usage data using Bugout (here's the code that does it). It does not collect user data other than anonymized IP address data, and it only logs the Deep Lake library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous. You can always opt-out of reporting by setting an environmental variable BUGGER_OFF to True:

Citation

If you use Deep Lake in your research, please cite Activeloop using:

@article{deeplake,
  title = {Deep Lake: a Lakehouse for Deep Learning},
  author = {Hambardzumyan, Sasun and Tuli, Abhinav and Ghukasyan, Levon and Rahman, Fariz and Topchyan, Hrant and Isayan, David and Harutyunyan, Mikayel and Hakobyan, Tatevik and Stranic, Ivo and Buniatyan, Davit},
  url = {https://www.cidrdb.org/cidr2023/papers/p69-buniatyan.pdf},
  booktitle={Proceedings of CIDR},
  year = {2023},
}

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

deeplake's People

Contributors

Stargazers

Watchers

Forkers

jesusoctavioas ofirbb sophiaar ssahgal mikayelh aren-beglaryan gevorgaghekyan 40a krhero03 sambhavbhurtel adi10hero aryansharmaa s-arpit007 rohankmr414 anshssonkhia ishaan75 souvikgithub sparkingdark tanujcbe samlpu sonam2905 prithviraj-maurya sheetal01761 sumitsrivastav180 musk101 scratcher007lakshya vocalfan kzuri techonerd aksh-02 senguyen1011 ritikaag zomglings harshitsankhla sulochanaviji deepakpatel100297 eshamahendra charlottemach rajat1433 td-bolaji hugovk krabhi977 vaishnavi-cyber-blip rickyja69 verrah aks998 abhilasha06 ad-dt shanonbaker arunima8jan valerianpereira sanchitvj sudiptog81 vegetablejuiceftw vvkpd averraver deeplearning2012 shivam1718 anuj840 pathanzafar702 rheehot deepandy thomascherickal asheshz adelzakirov dpkeee rcplay haiyangdeperci gyanachand1 maxcodextc rohitpandey13 4thel00z evolearner georgi-petkov gazzola adbmd atom-101 vitamen espoirgk vasileiosaidonis manyshapes shekhawatsanjay97 yuvalofer raijinspecial mbrukman valeman pravin1237 sohamsshah darkborderman eyh0602 sanggusti nasheedyasin joeykhuang debadityapal vinod1988 gabriellewp x213212 krishnachaitanya1 shineusn pranavdurai10

deeplake's Issues

Fixes in to_tensorflow method

Observed a couple of problems while converting stored datasets to TensorFlow format that need some small fixes.

to_tensorflow fails when the meta information for a tensor includes dtype="object" ("object" dtype has been used for images, area, id, bbox in Coco dataset - https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py#L24)
A fix for this is to keep the dtype="uint8" or something similar while uploading. The Coco example needs to be updated to reflect this.

to_tensorflow also fails when it gets shape=(1,) in meta and the actual object has multiple dimensions, for example, an image.
This can be fixed by commenting out this line https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L633, which will set the output_shapes as None by default.

to_pytorch works fine in both the above cases.

Dataset Caching

Describe the feature

Zarr caching is per array not shared. Please come up with shared caching, and once the dataset is uploaded. We need our shared storage to have options to write to the array. Dataset will have .commit() function. Once caching get an alert if the dataset is ready, call .commit().

Additional notes

PR to release/v1.0

Add Caltech Pedestrian Detection dataset

Add Caltech Pedestrian Detection dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Merging all together

Describe the feature

Combine all PRs together
Finalize the user API
Test backend
Add documentation

Notes

PR to release release/v1.0

s3 access via IAM-role doesn't seem to work

trying to connect to s3 without explicit credentials by running from ec2 with IAM role that allows access:

datahub = hub.s3('my-bucket').connect()

(boto seems to supports this mode: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#iam-role)

but I get access denied:

Traceback (most recent call last):
  File "hub_load_s3.py", line 13, in <module>
    imagenet = datahub.open('imagenet/test:latest')
  File "/usr/local/lib/python3.6/dist-packages/hub/bucket.py", line 41, in open
    jsontext = self._storage.get_or_none(os.path.join(name, "info.json"))
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/retry_wrapper.py", line 30, in get_or_none
    return self._internal.get_or_none(path)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 55, in get_or_none
    raise err
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 48, in get_or_none
    Key=path,
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

I do manage to access s3 from this machine just fine with aws-cli without configured credentials.

Is this mode of s3 authentication supported?

Add IMDB-WIKI dataset

Add IMDB-WIKI dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Add VIRAT Video Dataset

Describe the dataset

Add VIRAT dataset to Hub. So this would work.

import hub
ds = hub.load("username/VIRAT")

Here's a tutorial for uploading datasets using Hub.

Concerns related to provided guidelines.

Can not find the Uploading MNIST, Uploading CIFAR, Uploading COCO URLs provided in Guidelines. Seems that they does not exist.

Creation of the dataset seems to be successful:

`from hub import tensor, dataset

images = tensor.from_array(np.zeros((4, 512, 512)))
labels = tensor.from_array(np.zeros((4, 512, 512)))

ds = dataset.from_tensors({"images": images, "labels": labels})
ds.store("username/dataset") # Upload`

but can not see the uploaded dataset in https://app.activeloop.ai/datasets.

The consecutive issues are based on the lack of info mentioned above.

Add code test coverage

Describe Task

Add code test coverage for the repository.

Notes

Feel free to use any online resource and connect with circleci.

Add Barcelona Dataset

Describe the dataset

Add Barcelona dataset to Hub. So this would work.

import hub
ds = hub.load("username/barcelona")

Here's a tutorial for uploading datasets using Hub.

Dynamic Shapes

Discuss Dynamic Shapes and implement

Add Pascal Dataset

Describe the dataset

Add Pascal dataset to Hub. So this would work.

import hub
ds = hub.load("username/pascal")

Here's a tutorial for uploading datasets using Hub.

GZIP Compression

Add Gzip compression to chunks

tensor.from_array() fails running.

Same values in Dataset

I have a Dataset of logs which is defined like that:

logs = Dataset(dtype={"train_acc": float, "train_loss": float, "val_acc": float, "val_loss": float},
                         shape=(epochs,), url='./logs', mode='w')

I also have some average metrics stored in a dict, i.e.

metrics =  {'val_loss': AverageValue,   'val_acc': AverageValue, 'train_loss': .......}
metrics['val_loss'].avg   # tensor(1.2748, device='cuda:0')
metrics['val_acc'].avg    # tensor(0.5000, device='cuda:0')

To store those metrics in logs, I run:

for key, value in self.meters.items():
    self.logs[key][value.count - 1] = value.avg  #value.count is an index of a value starting from 1

But when I run

logs['val_acc'][0].numpy()
logs['val_loss'][0].numpy()
logs['train_acc'][0].numpy()
logs['train_acc'][0].numpy()

all these values are equal.

optional dependency torch failing on import hub

trying to use the package fails:

>>> import hub
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/__init__.py", line 2, in <module>
    from .creds import Base as Creds
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/creds.py", line 4, in <module>
    from .bucket import Bucket
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/bucket.py", line 7, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

This is a bad first experience really. is there a way to use hub without the torch failing me?

Add top contributors to Readme

Describe Task

Add top contributors to the readme. For reference please take a look at other repositories. https://github.com/huggingface/transformers

Slices views of datasets

Describe Feature

Implement virtual datasets

Get and set a dataset from the subview of the dataset.

Additional Notes

PR to release/v1.0

Add Cityscapes dataset

Describe the dataset

Add Cityscapes dataset to Hub. So this would work.

import hub
ds = hub.load("username/cityscapes")

Here's a tutorial for uploading datasets using Hub.

Support TFDS datasets

Describe your feature request

Create a converter that takes any Tensorflow Dataset dataset and convert it into a hub format.

from hub import datasets
import tensorflow_datasets as tfds

tfds = tfds.load('mnist', split='train', shuffle_files=True)
ds = datasets.from_tensorflow(tfds)
ds = ds.store("/tmp/mnist")
hub_tf_ds = ds.to_tensorflow()

# assert hub_tf_ds == tfds <- sample from both datasets and check if they are the same
hub_py_ds = ds.to_pytorch()

# assert hub_tf_ds == hub_py_ds <- sample from both datasets and check if they are the same

More advanced test would require to run the conversion for each dataset in parrarel

for name in tfds.list_builders()
    tfds = tfds.load(name, ...)
    ds = datasets.from_tensorflow(tfds)
    ds = ds.store(f"/tmp/{name}")

Notes

General advice: start with simple small datasets (low hanging fruits), commit often maybe mid-PRs to Master, then steadily generalize your converter.
TFDS has FeatureDict for understanding data archetypes, we need to rely on them.
This task would require to significantly extend how Hub handles different data and data types (dtags)
At every step think how uploading dataset could be simplified from a user perspective
While converting datasets take into consideration compression

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Add Stanford Background Dataset

Describe the dataset

Add Stanford background dataset to Hub. So this would work.

import hub
ds = hub.load("username/stanford-background")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

Add SIFT Flow Dataset

Describe the dataset

Add SIFT Flow dataset to Hub. So this would work.

import hub
ds = hub.load("username/SIFTFlow")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

Add Tiny ImageNet dataset

Describe the dataset

Add Tiny ImageNet dataset to Hub. So this would work.

import hub
ds = hub.load("username/tiny-imagenet")

Here's a tutorial for uploading datasets using Hub.

Add FSDKaggle2019 dataset

I would like to add this dataset to Hub

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Add documentation for using text labels in datasets

Add documentation for the function get_text that retrieves text labels from the stored dataset.
Function definition:- https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L655
Examples of usage:-
Pytorch: https://github.com/activeloopai/Hub/blob/master/examples/fashion-mnist/train_pytorch.py#L95
Tensorflow: https://github.com/activeloopai/Hub/blob/master/examples/fashion-mnist/train_tf_gradient_tape.py#L84

Exception: Connect timeout on endpoint URL

Add Dataset The 20BN-something V2

Describe the dataset

Add The 20BN-something dataset to Hub. So this would work.

import hub
ds = hub.load("username/the20b-something")

Here's a tutorial for uploading datasets using Hub.

Add CI/CD

Describe your feature request

Please add CI/CD for open source development

Will automatically run tests with
- Embedded AWS S3 credentials connection tests
- Embedded GCS credentials tests
- PyTorch/Tensorflow tests
Move docs here from dataflow and automatically deploy docs from here
On Tag will make a package and deploy to Pipy
Ci/CD badge to Readme
Test Coverage to Readme

Kinetics-700

Describe the dataset

Add Kinetics-700 dataset to Hub. So this would work.

import hub
ds = hub.load("username/kinetics-700")

Here's a tutorial for uploading datasets using Hub.

Create a Colab demo

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

tensorflow integration not supported

dataset.to_tensorflow() seems to be broken:

https://github.com/snarkai/Hub/blob/9539dc4d9699df454d87dd4faf480c763945cd33/hub/dataset.py#L11-L14

isn't dataset integration with tensorflow supposed to be supported?

Serialization of data types

Describe your feature

Implement the Serialization of Dataset Structure. Then implement Tensor derivatives such as Image, ClassLabel, Mask, Segmentation, Polygon, Bounding Box, Tabular, and other TFDS Features.

There are two subtasks

Serialize and Deserialize the metadata into meta.json
Implement Tensor Derivatives

Additional Notes

Have a call with @edogrigqv2 to get started
Please pull request to release/v1.0 ask for a review

MPII Human Pose Dataset

Describe the dataset

Add MPII Human Pose Dataset dataset to Hub. So this would work.

import hub
ds = hub.load("username/mpii-human-pose-dataset")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

From Pytorch dataset to Hub format

Describe your feature request

Create a converter that takes any PyTorch dataset and converts it into a hub format.

A simple test would be

from hub import datasets
import torch
import torchvision

imagenet = torchvision.datasets.ImageNet(path, split='train', transforms=...)
ds = datasets.from_pytorch(imagenet)
ds = ds.store("/tmp/imagenet")
ds = ds.to_pytorch()

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Documentation for using ingestors for big datasets

Add documentation describing transformers being used as dataset ingestors while uploading large datasets.

References:-
Example of usage:- https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py
Upload documentation:- https://docs.activeloop.ai/en/latest/concepts/dataset.html#how-to-upload-a-dataset
Documentation for transformers:- https://docs.activeloop.ai/en/latest/concepts/transform.html

Errors in Dataset docs

I came upon following errors while referencing docs :

Under Guidelines sub-heading, there is enumeration error. Proposed solution - Using indentation of 4, this issue could be solved. To make page uniform, former changes are applied for every code blocks.
Under How to Upload a Dataset sub-heading, links are updated for MNIST, CIFAR and COCO examples.

PermissionException on AWS

Facing issues with ds.store() on AWS while the same code works properly locally.
Error : hub.exceptions.PermissionException: No permision to store the dataset at s3://snark-hub/public/abhinav/ds

For now, got it working using sudo rm -rf /tmp/dask-worker-space/.
A proper fix is needed.

class labels access and labels shape

Hi,
If I understand it correctly, the shape of 'labels' which is (frame_count, 11, 400, 7) corresponds to the 7 box values (center coordinates, box dimensions and heading) for a maximum of 400 obstacles in each of the images, lasers_camera_projection and lasers_range_image.
I have two questions here,

Where can I access the class labels? Eg. Pedestrian, Vehicle etc.
Considering the number of images, lasers_camera_projection, lasers_range_image to be 5 each, should it be 15 in place of 11?

Repository Naming Error

There is an error naming hub repositories

Store and Load Models

Describe Feature

We want to be able to store and load models, similar to datasets. The model has a computational graph and weights. Look into how Pytorch saves and loads the model. Check ONNX or TF later.

form hub import model

resnet = model.load("username/resnet")
model.store("username/resnet2", resnet)

Notes

Start from Pytorch
Implement TensorFlow
Include ONNX and other types.

ClientError: An error occurred (ExpiredToken) when calling the PutObject operation: The provided token has expired.

Have tried to store the dataset trice.

The loading was taking place more than 12 hours, however, at some point, I am getting the ExpiredToken error:

Have already uploaded the script and have provided the URLs of data and annotations in issues/36

Dynamic Shape Handling

Describe Feature details

Implement Dynamic Shapes for tensors with according chunking

Automatic size extension/expansion.
Boundary checking

Additional notes

PR to release/v1.0

Datatypes to Zarr files

Describe the feature

Zarr needs max_shape, url, and other parameters. We need a component to parse the hierarchy of datatypes and let the dataset to create the zarr arrays. Similar to StorageTensor and DynamicTensor (it's an API). This is implemented in _flatten, we need to make more robust.

Solution

Decide on API
Add max_shape
Decide on how to chunk automatically

Add CamVid dataset

Add CamVid dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Auto-chunking of tensor based user given dtype information with minimal extra user input

We have to find a way to understand how we do tensor chunking without explicitly asking user how to do it. The less extra information we ask from user to figure this out the better. Maybe it is better to go with suboptimal solution for sake of not asking user extra information.

Available Datasets

I want to thank you first for the great work.
My question is, are you planning to provide pre-loaded datasets like the Imagenet example?
I tried to load Nuscenes dataset, but it is not working.

AttributeError: 'Dataset' object has no attribute 'compute'

hub.array assignment not as robust as np.array

hub_array[0, :,:,:] = image
raises a MemoryError: unable to allocate 112GiB with shape ...

while
hub_array[0] = image
works fine as expected.

in np.array both versions work the same way.

took us several hours to debug the issue, and in the end was just a slight incompatibility to np.array.
would be nice to add the first assignment just in case someone else tries it expecting np.array behavior.

Open Images Dataset

Describe the dataset

Add Open Images Dataset dataset to Hub. So this would work.

import hub
ds = hub.load("username/open-images")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

activeloopai / deeplake Goto Github PK

deeplake's Introduction

Deep Lake: Database for AI

Docs • Get Started • API Reference • LangChain & VectorDBs Course • Blog • Whitepaper • Slack • Twitter

What is Deep Lake?

Deep Lake includes the following features:

🚀 Performance

🚀 How to install Deep Lake

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

- Vector Store Quickstart

- Vector Store Getting Started Guide

- Using Deep Lake with LangChain

- Image Similarity Search with Deep Lake

Deep Learning Applications

- Deep Learning Quickstart

- Deep Learning Getting Started Guide

- Tutorials for Training Models

- Tutorials for Creating Deep Learning Datasets

- Deep Learning Playbooks

⚙️ Integrations

📚 Documentation

🎓 For Students and Educators

👩‍💻 Comparisons to Familiar Tools

Community

README Badge

Disclaimers

Citation

Acknowledgment

deeplake's People

Contributors

Stargazers

Watchers

Forkers

deeplake's Issues

Describe the feature

Additional notes

Describe the feature

Notes

Describe the dataset

Describe Task

Notes

Describe the dataset

Describe the dataset

Describe Task

Describe Feature

Additional Notes

Describe the dataset

Describe your feature request

Notes

Describe the dataset

Steps

Example

Describe the dataset

Steps

Example

Describe the dataset

Create a tutorial on Colab

Describe the dataset

Describe your feature request

Describe the dataset

Create a tutorial on Colab

Describe your feature

Additional Notes

Describe the dataset

Steps

Example

Describe your feature request

Describe Feature

Notes

Describe Feature details

Additional notes

Describe the feature

Solution

Describe the dataset

Steps

Example

Recommend Projects

Recommend Topics