GithubHelp home page GithubHelp logo

bytehub-ai / bytehub Goto Github PK

View Code? Open in Web Editor NEW
58.0 3.0 4.0 372 KB

ByteHub: making feature stores simple

License: GNU General Public License v3.0

Python 100.00%
data-science data-engineering machine-learning feature-engineering dask pandas timeseries bytehub-cloud forecasting feature-store

bytehub's Introduction

ByteHub PyPI Latest Release Issues Issues Code style: black

ByteHub logo

An easy-to-use feature store.

๐Ÿ’พ What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

๐Ÿฆ‰ Features

  • Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
  • Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
  • Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
  • Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
  • Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
  • Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as โ˜๏ธ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

๐Ÿ“– Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

๐Ÿš€ Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

๐Ÿพ Roadmap

  • Tasks to automate updates to features using orchestration tools like Airflow

bytehub's People

Contributors

github-actions[bot] avatar toby-coleman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bytehub's Issues

Frequently append data or row to feature dataframes indexed on time

Hi,

I have a use case where time series dataframe needs to be appended with the latest values every 5 minutes, is there a method or an elegant way of appending time series value/row to a feature dataframe very few minutes. One way would be to load the entire dataframe from the featurestore everytime in memory and then append a row and then rewrite the entire dataframe. But I was wondering if there is a method that just appends it or sort of updates it ?

Error when using google cloud storage as a backend

Hi, I am trying to get my setup working on google cloud, when i try saving the dataframe to cloud storage, using fs.save_datframe I run into an error like the one shown below,

_call non-retriable exception: Disallowed unicode characters present in object name ''tutorial/feature/append-dataframe/partition=npartitions=1

I confirmed my google cloud storage saving functionality for writing dataframe using . to_parquest and supplying the google cloud storage path gs://{bucketname}/{foldername}, which seems to work as expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.