GithubHelp home page GithubHelp logo

activeloopai / deeplake Goto Github PK

View Code? Open in Web Editor NEW
7.7K 86.0 590.0 66.33 MB

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page: https://activeloop.ai

License: Mozilla Public License 2.0

Dockerfile 0.01% Python 99.98% Shell 0.01%
datasets deep-learning machine-learning data-science pytorch tensorflow data-version-control python ai ml

deeplake's Introduction


Deep Lake: Database for AI

PyPI version PyPI version Documentation Status PyPI version GitHub issues codecov

Read this in other languages: 简体中文

What is Deep Lake?

Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for:

  1. Storing data and vectors while building LLM applications
  2. Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

Deep Lake includes the following features:

Multi-Cloud Support (S3, GCP, Azure) Use one API to upload, download, and stream datasets to/from S3, Azure, GCP, Activeloop cloud, local storage, or in-memory storage. Compatible with any S3-compatible storage such as MinIO.
Native Compression with Lazy NumPy-like Indexing Store images, audio, and videos in their native compression. Slice, index, iterate, and interact with your data like a collection of NumPy arrays in your system's memory. Deep Lake lazily loads data only when needed, e.g., when training a model or running queries.
Dataset Version Control Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well!
Dataloaders for Popular Deep Learning Frameworks Deep Lake comes with built-in dataloaders for Pytorch and TensorFlow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
Integrations with Powerful Tools Deep Lake has integrations with Langchain and LLamaIndex as a vector store for LLM apps, Weights & Biases for data lineage during model training, and MMDetection for training object detection models.
100+ most-popular image, video, and audio datasets available in seconds Deep Lake community has uploaded 100+ image, video and audio datasets like MNIST, COCO, ImageNet, CIFAR, GTZAN and others.
Instant Visualization Support in the Deep Lake App Deep Lake datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Deep Lake Visualizer (see below).

Visualizer

🚀 Performance

Deep Lake's performant dataloader built in C++ speeds up data streaming by >2x compared to Hub 2.x (Ofeidis et al. 2022, Hambardzumyan et al. 2023)

🚀 How to install Deep Lake

Deep Lake can be installed using pip:

pip3 install deeplake

By default, Deep Lake does not install dependencies for audio, video, google-cloud, and other features. Details on all installation options are available here.

To access all of Deep Lake's features, please register in the Deep Lake App.

🧠 Deep Lake Code Examples by Application

Vector Store Applications

Using Deep Lake as a Vector Store for building LLM applications:

Deep Learning Applications

Using Deep Lake for managing data while training Deep Learning models:

⚙️ Integrations

Deep Lake offers integrations with other tools in order to streamline your deep learning workflows. Current integrations include:

  • Model Training

    • Stream data while training thousands of pre-built models using MMDetection, a popular open-source object detection toolbox based on PyTorch. Learn more in this tutorial.
  • Experiment Tracking

    • Track experiments and achieve full model reproducibility using Deep Lake and Weights & Biases. Our integration automatically pushes dataset-related information (uri, commit hash, view id) to your W&B runs. Further details are available in our model-reproducibility playbook.
  • LLM Apps

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Deep Lake users can access and visualize a variety of popular datasets through a free integration with Deep Lake's App. Universities can get up to 1TB of data storage and 100,000 monthly queries on the Tensor Database for free per month. Chat in on our website: to claim the access!

👩‍💻 Comparisons to Familiar Tools

Deep Lake vs Chroma

Both Deep Lake & ChromaDB enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. ChromaDB is a Vector Database that can be deployed locally or on a server using Docker and will offer a hosted solution shortly. Deep Lake is a serverless Vector Store deployed on the user’s own cloud, locally, or in-memory. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike ChromaDB, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. ChromaDB is limited to light metadata on top of the embeddings and has no visualization. Deep Lake datasets can be visualized and version controlled. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Pinecone

Both Deep Lake and Pinecone enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Pinecone is a fully-managed Vector Database that is optimized for highly demanding applications requiring a search for billions of vectors. Deep Lake is serverless. All computations run client-side, which enables users to get started in seconds. Unlike Pinecone, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Pinecone is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs Weaviate

Both Deep Lake and Weaviate enable users to store and search vectors (embeddings) and offer integrations with LangChain and LlamaIndex. However, they are architecturally very different. Weaviate is a Vector Database that can be deployed in a managed service or by the user via Kubernetes or Docker. Deep Lake is serverless. All computations run client-side, which enables users to support lightweight production apps in seconds. Unlike Weaviate, Deep Lake’s data format can store raw data such as images, videos, and text, in addition to embeddings. Deep Lake datasets can be visualized and version controlled. Weaviate is limited to light metadata on top of the embeddings and has no visualization. Deep Lake also has a performant dataloader for fine-tuning your Large Language Models.

Deep Lake vs DVC

Deep Lake and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Deep Lake converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Deep Lake format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Deep Lake is a Python package. Lastly, Deep Lake offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Deep Lake vs MosaicML MDS format
  • Data Storage Format: Deep Lake operates on a columnar storage format, whereas MDS utilizes a row-wise storage approach. This fundamentally impacts how data is read, written, and organized in each system.
  • Compression: Deep Lake offers a more flexible compression scheme, allowing control over both chunk-level and sample-level compression for each column or tensor. This feature eliminates the need for additional compressions like zstd, which would otherwise demand more CPU cycles for decompressing on top of formats like jpeg.
  • Shuffling: MDS currently offers more advanced shuffling strategies.
  • Version Control & Visualization Support: A notable feature of Deep Lake is its native version control and in-browser data visualization, a feature not present for MosaicML data format. This can provide significant advantages in managing, understanding, and tracking different versions of the data.
Deep Lake vs TensorFlow Datasets (TFDS)

Deep Lake and TFDS seamlessly connect popular datasets to ML frameworks. Deep Lake datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Deep Lake and TFDS is that Deep Lake datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Deep Lake, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Deep Lake also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Deep Lake vs HuggingFace Deep Lake and HuggingFace offer access to popular datasets, but Deep Lake primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Deep Lake.
Deep Lake vs WebDatasets Deep Lake and WebDatasets both offer rapid data streaming across networks. They have nearly identical steaming speeds because the underlying network requests and data structures are very similar. However, Deep Lake offers superior random access and shuffling, its simple API is in python instead of command-line, and Deep Lake enables simple indexing and modification of the dataset without having to recreate it.
Deep Lake vs Zarr Deep Lake and Zarr both offer storage of data as chunked arrays. However, Deep Lake is primarily designed for returning data as arrays using a simple API, rather than actually storing raw arrays (even though that's also possible). Deep Lake stores data in use-case-optimized formats, such as jpeg or png for images, or mp4 for video, and the user treats the data as if it's an array, because Deep Lake handles all the data processing in between. Deep Lake offers more flexibility for storing arrays with dynamic shape (ragged tensors), and it provides several features that are not naively available in Zarr such as version control, data streaming, and connecting data to ML Frameworks.

Community

Join our Slack community to learn more about unstructured dataset management using Deep Lake and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Deep Lake.

README Badge

Using Deep Lake? Add a README badge to let everyone know:

deeplake

[![deeplake](https://img.shields.io/badge/powered%20by-Deep%20Lake%20-ff5a1f.svg)](https://github.com/activeloopai/deeplake)

Disclaimers

Dataset Licenses

Deep Lake users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Usage Tracking

By default, we collect usage data using Bugout (here's the code that does it). It does not collect user data other than anonymized IP address data, and it only logs the Deep Lake library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous. You can always opt-out of reporting by setting an environmental variable BUGGER_OFF to True:

Citation

If you use Deep Lake in your research, please cite Activeloop using:

@article{deeplake,
  title = {Deep Lake: a Lakehouse for Deep Learning},
  author = {Hambardzumyan, Sasun and Tuli, Abhinav and Ghukasyan, Levon and Rahman, Fariz and Topchyan, Hrant and Isayan, David and Harutyunyan, Mikayel and Hakobyan, Tatevik and Stranic, Ivo and Buniatyan, Davit},
  url = {https://www.cidrdb.org/cidr2023/papers/p69-buniatyan.pdf},
  booktitle={Proceedings of CIDR},
  year = {2023},
}

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

deeplake's People

Contributors

abhinavtuli avatar activesoull avatar adolkhan avatar artgish avatar as-engineer avatar benchislett avatar davidbuniat avatar dependabot-preview[bot] avatar dhiganthrao avatar diveafall avatar edogrigqv2 avatar farizrahman4u avatar fayazrahman avatar haiyangdeperci avatar imshashank avatar istranic avatar khustup avatar khustup2 avatar kristinagrig06 avatar levongh avatar mikayelh avatar mynameisvinn avatar nvoxland avatar nvoxland-al avatar progerdav avatar sounakr avatar sparkingdark avatar tatevikh avatar thisiseshan avatar verbose-void avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeplake's Issues

Fixes in to_tensorflow method

Observed a couple of problems while converting stored datasets to TensorFlow format that need some small fixes.

to_tensorflow fails when the meta information for a tensor includes dtype="object" ("object" dtype has been used for images, area, id, bbox in Coco dataset - https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py#L24)
A fix for this is to keep the dtype="uint8" or something similar while uploading. The Coco example needs to be updated to reflect this.

to_tensorflow also fails when it gets shape=(1,) in meta and the actual object has multiple dimensions, for example, an image.
This can be fixed by commenting out this line https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L633, which will set the output_shapes as None by default.

to_pytorch works fine in both the above cases.

Dataset Caching

Describe the feature

Zarr caching is per array not shared. Please come up with shared caching, and once the dataset is uploaded. We need our shared storage to have options to write to the array. Dataset will have .commit() function. Once caching get an alert if the dataset is ready, call .commit().

Additional notes

PR to release/v1.0

Merging all together

Describe the feature

  • Combine all PRs together
  • Finalize the user API
  • Test backend
  • Add documentation

Notes

PR to release release/v1.0

s3 access via IAM-role doesn't seem to work

trying to connect to s3 without explicit credentials by running from ec2 with IAM role that allows access:

datahub = hub.s3('my-bucket').connect()

(boto seems to supports this mode: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#iam-role)

but I get access denied:

Traceback (most recent call last):
  File "hub_load_s3.py", line 13, in <module>
    imagenet = datahub.open('imagenet/test:latest')
  File "/usr/local/lib/python3.6/dist-packages/hub/bucket.py", line 41, in open
    jsontext = self._storage.get_or_none(os.path.join(name, "info.json"))
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/retry_wrapper.py", line 30, in get_or_none
    return self._internal.get_or_none(path)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 55, in get_or_none
    raise err
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 48, in get_or_none
    Key=path,
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

I do manage to access s3 from this machine just fine with aws-cli without configured credentials.

Is this mode of s3 authentication supported?

Add VIRAT Video Dataset

Describe the dataset

Add VIRAT dataset to Hub. So this would work.

import hub
ds = hub.load("username/VIRAT")

Here's a tutorial for uploading datasets using Hub.

Concerns related to provided guidelines.

Can not find the Uploading MNIST, Uploading CIFAR, Uploading COCO URLs provided in Guidelines. Seems that they does not exist.

Creation of the dataset seems to be successful:

`from hub import tensor, dataset

images = tensor.from_array(np.zeros((4, 512, 512)))
labels = tensor.from_array(np.zeros((4, 512, 512)))

ds = dataset.from_tensors({"images": images, "labels": labels})
ds.store("username/dataset") # Upload`

but can not see the uploaded dataset in https://app.activeloop.ai/datasets.

The consecutive issues are based on the lack of info mentioned above.

Add code test coverage

Describe Task

Add code test coverage for the repository.

Notes

Feel free to use any online resource and connect with circleci.

Add Barcelona Dataset

Describe the dataset

Add Barcelona dataset to Hub. So this would work.

import hub
ds = hub.load("username/barcelona")

Here's a tutorial for uploading datasets using Hub.

Add Pascal Dataset

Describe the dataset

Add Pascal dataset to Hub. So this would work.

import hub
ds = hub.load("username/pascal")

Here's a tutorial for uploading datasets using Hub.

Same values in Dataset

I have a Dataset of logs which is defined like that:

logs = Dataset(dtype={"train_acc": float, "train_loss": float, "val_acc": float, "val_loss": float},
                         shape=(epochs,), url='./logs', mode='w')

I also have some average metrics stored in a dict, i.e.

metrics =  {'val_loss': AverageValue,   'val_acc': AverageValue, 'train_loss': .......}
metrics['val_loss'].avg   # tensor(1.2748, device='cuda:0')
metrics['val_acc'].avg    # tensor(0.5000, device='cuda:0')

To store those metrics in logs, I run:

for key, value in self.meters.items():
    self.logs[key][value.count - 1] = value.avg  #value.count is an index of a value starting from 1

But when I run

logs['val_acc'][0].numpy()
logs['val_loss'][0].numpy()
logs['train_acc'][0].numpy()
logs['train_acc'][0].numpy()

all these values are equal.

optional dependency torch failing on import hub

trying to use the package fails:

>>> import hub
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/__init__.py", line 2, in <module>
    from .creds import Base as Creds
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/creds.py", line 4, in <module>
    from .bucket import Bucket
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/bucket.py", line 7, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

This is a bad first experience really. is there a way to use hub without the torch failing me?

Slices views of datasets

Describe Feature

Implement virtual datasets

  1. Get and set a dataset from the subview of the dataset.

Additional Notes

PR to release/v1.0

Support TFDS datasets

Describe your feature request

Create a converter that takes any Tensorflow Dataset dataset and convert it into a hub format.

from hub import datasets
import tensorflow_datasets as tfds

tfds = tfds.load('mnist', split='train', shuffle_files=True)
ds = datasets.from_tensorflow(tfds)
ds = ds.store("/tmp/mnist")
hub_tf_ds = ds.to_tensorflow()

# assert hub_tf_ds == tfds <- sample from both datasets and check if they are the same
hub_py_ds = ds.to_pytorch()

# assert hub_tf_ds == hub_py_ds <- sample from both datasets and check if they are the same

More advanced test would require to run the conversion for each dataset in parrarel

for name in tfds.list_builders()
    tfds = tfds.load(name, ...)
    ds = datasets.from_tensorflow(tfds)
    ds = ds.store(f"/tmp/{name}")

Notes

  • General advice: start with simple small datasets (low hanging fruits), commit often maybe mid-PRs to Master, then steadily generalize your converter.
  • TFDS has FeatureDict for understanding data archetypes, we need to rely on them.
  • This task would require to significantly extend how Hub handles different data and data types (dtags)
  • At every step think how uploading dataset could be simplified from a user perspective
  • While converting datasets take into consideration compression

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Create a tutorial on Colab

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Add CI/CD

Describe your feature request

Please add CI/CD for open source development

  • Will automatically run tests with
    • Embedded AWS S3 credentials connection tests
    • Embedded GCS credentials tests
    • PyTorch/Tensorflow tests
  • Move docs here from dataflow and automatically deploy docs from here
  • On Tag will make a package and deploy to Pipy
  • Ci/CD badge to Readme
  • Test Coverage to Readme

Kinetics-700

Describe the dataset

Add Kinetics-700 dataset to Hub. So this would work.

import hub
ds = hub.load("username/kinetics-700")

Here's a tutorial for uploading datasets using Hub.

Create a Colab demo

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Serialization of data types

Describe your feature

Implement the Serialization of Dataset Structure. Then implement Tensor derivatives such as Image, ClassLabel, Mask, Segmentation, Polygon, Bounding Box, Tabular, and other TFDS Features.

There are two subtasks

  1. Serialize and Deserialize the metadata into meta.json
  2. Implement Tensor Derivatives

Additional Notes

Have a call with @edogrigqv2 to get started
Please pull request to release/v1.0 ask for a review

From Pytorch dataset to Hub format

Describe your feature request

Create a converter that takes any PyTorch dataset and converts it into a hub format.

A simple test would be

from hub import datasets
import torch
import torchvision

imagenet = torchvision.datasets.ImageNet(path, split='train', transforms=...)
ds = datasets.from_pytorch(imagenet)
ds = ds.store("/tmp/imagenet")
ds = ds.to_pytorch()

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Errors in Dataset docs

I came upon following errors while referencing docs :

  1. Under Guidelines sub-heading, there is enumeration error. Proposed solution - Using indentation of 4, this issue could be solved. To make page uniform, former changes are applied for every code blocks.
  2. Under How to Upload a Dataset sub-heading, links are updated for MNIST, CIFAR and COCO examples.

PermissionException on AWS

Facing issues with ds.store() on AWS while the same code works properly locally.
Error : hub.exceptions.PermissionException: No permision to store the dataset at s3://snark-hub/public/abhinav/ds

For now, got it working using sudo rm -rf /tmp/dask-worker-space/.
A proper fix is needed.

class labels access and labels shape

Hi,
If I understand it correctly, the shape of 'labels' which is (frame_count, 11, 400, 7) corresponds to the 7 box values (center coordinates, box dimensions and heading) for a maximum of 400 obstacles in each of the images, lasers_camera_projection and lasers_range_image.
I have two questions here,

  1. Where can I access the class labels? Eg. Pedestrian, Vehicle etc.
  2. Considering the number of images, lasers_camera_projection, lasers_range_image to be 5 each, should it be 15 in place of 11?

Store and Load Models

Describe Feature

We want to be able to store and load models, similar to datasets. The model has a computational graph and weights. Look into how Pytorch saves and loads the model. Check ONNX or TF later.

form hub import model

resnet = model.load("username/resnet")
model.store("username/resnet2", resnet)

Notes

  1. Start from Pytorch
  2. Implement TensorFlow
  3. Include ONNX and other types.

Dynamic Shape Handling

Describe Feature details

Implement Dynamic Shapes for tensors with according chunking

  1. Automatic size extension/expansion.
  2. Boundary checking

Additional notes

PR to release/v1.0

Datatypes to Zarr files

Describe the feature

Zarr needs max_shape, url, and other parameters. We need a component to parse the hierarchy of datatypes and let the dataset to create the zarr arrays. Similar to StorageTensor and DynamicTensor (it's an API). This is implemented in _flatten, we need to make more robust.

Solution

  • Decide on API
  • Add max_shape
  • Decide on how to chunk automatically

Available Datasets

I want to thank you first for the great work.
My question is, are you planning to provide pre-loaded datasets like the Imagenet example?
I tried to load Nuscenes dataset, but it is not working.

hub.array assignment not as robust as np.array

hub_array[0, :,:,:] = image
raises a MemoryError: unable to allocate 112GiB with shape ...

while
hub_array[0] = image
works fine as expected.

in np.array both versions work the same way.

took us several hours to debug the issue, and in the end was just a slight incompatibility to np.array.
would be nice to add the first assignment just in case someone else tries it expecting np.array behavior.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.