PipeChain is a utility library for creating functional pipelines.

Last update: Aug 07, 2022

Related tags

Overview

PipeChain

Motivation

PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Australian phone numbers from our users. We need to clean this data before we insert it into the database. With PipeChain, you can do this whole process in one neat pipeline:

from pipechain import PipeChain, PLACEHOLDER as _

nums = [
    "493225813",
    "0491 570 156",
    "55505488",
    "Barry",
    "02 5550 7491",
    "491570156",
    "",
    "1800 975 707"
]

PipeChain(
    nums
).pipe(
    # Remove spaces
    map, lambda x: x.replace(" ", ""), _
).pipe(
    # Remove non-numeric entries
    filter, lambda x: x.isnumeric(), _
).pipe(
    # Add the mobile code to the start of 8-digit numbers
    map, lambda x: "04" + x if len(x) == 8 else x, _
).pipe(
    # Add the 0 to the start of 9-digit numbers
    map, lambda x: "0" + x if len(x) == 9 else x, _
).pipe(
    # Convert to a set to remove duplicates
    set
).eval()

{'0255507491', '0455505488', '0491570156', '0493225813', '1800975707'}

Without PipeChain, we would have to horrifically nest our code, or else use a lot of temporary variables:

set(
    map(
        lambda x: "0" + x if len(x) == 9 else x,
        map(
            lambda x: "04" + x if len(x) == 8 else x,
            filter(
                lambda x: x.isnumeric(),
                map(
                    lambda x: x.replace(" ", ""),
                    nums
                )
            )
        )
    )
)

{'0255507491', '0455505488', '0491570156', '0493225813', '1800975707'}

Installation

pip install pipechain

Usage

Basic Usage

PipeChain has only two exports: PipeChain, and PLACEHOLDER.

PipeChain is a class that defines a pipeline. You create an instance of the class, and then call .pipe() to add another function onto the pipeline:

from pipechain import PipeChain, PLACEHOLDER
PipeChain(1).pipe(str)

PipeChain(arg=1, pipes=[functools.partial(
   
    )])

Finally, you call .eval() to run the pipeline and return the result:

PipeChain(1).pipe(str).eval()

'1'

You can "feed" the pipe at either end, either during construction (PipeChain("foo")), or during evaluation .eval("foo"):

PipeChain().pipe(str).eval(1)

'1'

Each call to .pipe() takes a function, and any additional arguments you provide, both positional and keyword, will be forwarded to the function:

PipeChain(["b", "a", "c"]).pipe(sorted, reverse=True).eval()

['c', 'b', 'a']

Argument Position

By default, the previous value is passed as the first positional argument to the function:

PipeChain(2).pipe(pow, 3).eval()

The only magic here is that if you use the PLACEHOLDER variable as an argument to .pipe(), then the pipeline will replace it with the output of the previous pipe at runtime:

PipeChain(2).pipe(pow, 3, PLACEHOLDER).eval()

Note that you can rename PLACEHOLDER to something more usable using Python's import statement, e.g.

from pipechain import PLACEHOLDER as _
PipeChain(2).pipe(pow, 3, _).eval()

Methods

It might not see like methods will play that well with this pipe convention, but after all, they are just functions. You should be able to access any object's method as a function by accessing it on that object's parent class. In the below example, str is the parent class of "":

"".join(["a", "b", "c"])

'abc'

PipeChain(["a", "b", "c"]).pipe(str.join, "", _).eval()

'abc'

Operators

The same goes for operators, such as +, *, [] etc. We just have to use the operator module in the standard library:

from operator import add, mul, getitem

PipeChain(5).pipe(mul, 3).eval()

PipeChain(5).pipe(add, 3).eval()

PipeChain(["a", "b", "c"]).pipe(getitem, 1).eval()

'b'

Test Suite

Note, you will need poetry installed.

To run the test suite, use:

git clone https://github.com/multimeric/PipeChain.git
cd PipeChain
poetry install
poetry run pytest test/test.py

PipeChain is a utility library for creating functional pipelines.

Related tags

Overview

PipeChain

Motivation

Installation

Usage

Basic Usage

Argument Position

Methods

Operators

Test Suite

Owner

Michael Milton

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

A simple and efficient tool to parallelize Pandas operations on all available CPUs

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

This is a repo documenting the best practices in PySpark.

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Functional tensors for probabilistic programming

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

PyEmits, a python package for easy manipulation in time-series data.

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

An Aspiring Drop-In Replacement for NumPy at Scale

Kennedy Institute of Rheumatology University of Oxford Project November 2019

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

A data analysis using python and pandas to showcase trends in school performance.

Wafer Fault Detection - Wafer circleci with python

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.