An easy-to-use feature store

Last update: Dec 09, 2022

Overview

ByteHub

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

Be simple to use, with a Pandas-like API;
Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

Tasks to automate updates to features using orchestration tools like Airflow

An easy-to-use feature store

Related tags

Overview

ByteHub

💾 What is a feature store?

🦉 Features

📖 Documentation and tutorials

🚀 Quick-start

🐾 Roadmap

Owner

ByteHub AI

Conduits - A Declarative Pipelining Tool For Pandas

Gathering data of likes on Tinder within the past 7 days

Python package for processing UC module spectral data.

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Data pipelines built with polars

Airflow ETL With EKS EFS Sagemaker

A Python adaption of Augur to prioritize cell types in perturbation analysis.

A meta plugin for processing timelapse data timepoint by timepoint in napari

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

Cleaning and analysing aggregated UK political polling data.

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

Display the behaviour of a realtime program with a scope or logic analyser.

simple way to build the declarative and destributed data pipelines with python

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Big Data & Cloud Computing for Oceanography

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.