PyEmits, a python package for easy manipulation in time-series data.

Last update: Sep 23, 2022

Related tags

Data Analysis PyEmits

Overview

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life.

Engineering
FSI industry (Financial Services Industry)
FMCG (Fast Moving Consumer Good)

Data scientist's work consists of:

forecasting
prediction/simulation
data prepration
cleansing
anomaly detection
descriptive data analysis/exploratory data analysis

each new business unit shall build the following wheels again and again

data pipeline
1. extraction
2. transformation
  1. cleansing
  2. feature engineering
  3. remove outliers
  4. AI landing for prediction, forecasting
3. write it back to database
ml framework
1. multiple model training
2. multiple model prediction
3. kfold validation
4. anomaly detection
5. forecasting
6. deep learning model in easy way
7. ensemble modelling
exploratory data analysis
1. descriptive data analysis
2. ...

That's why I create this project, also for fun. haha

This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features

Install

pip install pyemits

Features highlight

Easy training

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel

X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Accept neural network as model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

also keep flexibility on customized model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

model = Sequential()
model.add(LSTM(128,
               activation='softmax',
               input_shape=(10, 10),
               ))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

or attach it in algo config

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
                                           activation='softmax',
                                           input_shape=(10, 10),
                                           ),
                                      Dropout(0.1),
                                      Dense(4)],
                               compile=dict(loss='mse', optimizer='adam', metrics=['mse']))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
                     [config],
                     raw_data_model, 
                     {'fit_config' : [dict(epochs=10, batch_size=32)]})
trainer.fit()

PyTorch, MXNet under development you can leave me a message if you want to contribute

MultiOutput training

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24,24,ravel=True)
X, y = splitter.split(X, y)

raw_data_model = RegressionDataModel(X,y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Parallel training
- provide fast training using parallel job
- use RegTrainer as base, but add Parallel running

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

or you can use RegTrainer for multiple model, but it is not in Parallel job

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

KFold training
- KFoldConfig is global config, will apply to all

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model, {'kfold_config':KFoldConfig(n_splits=10)})
trainer.fit()

Easy prediction

import numpy as np 
from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))

Forecast at scale
- see examples: forecast at scale.ipynb
Data Model

from pyemits.common.data_model import RegressionDataModel
import numpy as np
X = np.random.randint(1, 100, size=(1000,10,10))
y = np.random.randint(1, 100, size=(1000, 1))

data_model = RegressionDataModel(X, y)

data_model._update_variable('X_shape', (1000,10,10))
data_model.X_shape

data_model.add_meta_data('X_shape', (1000,10,10))
data_model.meta_data

Anomaly detection (under development)
- see module: anomaly detection
- Kalman filter
Evaluation (under development)
- see module: evaluation
- backtesting
- model evaluation
Ensemble (under development)
- blending
- stacking
- voting
- by combo package
  - moa
  - aom
  - average
  - median
  - maximization
IO
- db connection
- local
dashboard ???
other miscellaneous feature
- continuous evaluation
- aggregation
- dimensional reduction
- data profile (intensive data overview)
to be confirmed

References

the following libraries gave me some idea/insight

greykit
1. changepoint detection
2. model summary
3. seaonality
pytorch-forecasting
darts
pyaf
orbit
kats/prophets by facebook
sktime
gluon ts
tslearn
pyts
luminaries
tods
autots
pyodds
scikit-hts

Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

101 Dec 7, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

3 Dec 14, 2022

Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

3k Jan 2, 2023

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

PyEmits, a python package for easy manipulation in time-series data.

Related tags

Overview

Install

Features highlight

References

You might also like...

Python package to transfer data in a fast, reliable, and packetized form.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

small package with utility functions for analyzing (fly) calcium imaging data

Integrate bus data from a variety of sources (batch processing and real time processing).

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Fast, flexible and easy to use probabilistic modelling in Python.

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Releases(v0.1.2)

v0.1.2(Dec 26, 2021)

Owner

Thompson

Calculate multilateral price indices in Python (with Pandas and PySpark).

Predictive Modeling & Analytics on Home Equity Line of Credit

Very basic but functional Kakuro solver written in Python.

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Example Of Splunk Search Query With Python And Splunk Python SDK

Visions provides an extensible suite of tools to support common data analysis operations

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

pandas: powerful Python data analysis toolkit

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Import, connect and transform data into Excel

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

ETL flow framework based on Yaml configs in Python

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Data pipelines built with polars

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Techdegree Data Analysis Project 2

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.