Tools for exploratory data analysis in Python

Overview

Dora

Exploratory data analysis toolkit for Python.

Contents

Summary

Dora is a Python library designed to automate the painful parts of exploratory data analysis.

The library contains convenience functions for data cleaning, feature selection & extraction, visualization, partitioning data for model validation, and versioning transformations of data.

The library uses and is intended to be a helpful addition to common Python data analysis tools such as pandas, scikit-learn, and matplotlib.

Setup

$ pip3 install Dora
$ python3
>>> from Dora import Dora

Usage

Reading Data & Configuration

# without initial config
>>> dora = Dora()
>>> dora.configure(output = 'A', data = 'path/to/data.csv')

# is the same as
>>> import pandas as pd
>>> dataframe = pd.read_csv('path/to/data.csv')
>>> dora = Dora(output = 'A', data = dataframe)

>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

Cleaning

# read data with missing and poorly scaled values
>>> import pandas as pd
>>> df = pd.DataFrame([
...   [1, 2, 100],
...   [2, None, 200],
...   [1, 6, None]
... ])
>>> dora = Dora(output = 0, data = df)
>>> dora.data
   0   1    2
0  1   2  100
1  2 NaN  200
2  1   6  NaN

# impute the missing values (using the average of each column)
>>> dora.impute_missing_values()
>>> dora.data
   0  1    2
0  1  2  100
1  2  4  200
2  1  6  150

# scale the values of the input variables (center to mean and scale to unit variance)
>>> dora.scale_input_values()
>>> dora.data
   0         1         2
0  1 -1.224745 -1.224745
1  2  0.000000  1.224745
2  1  1.224745  0.000000

Feature Selection & Extraction

# feature selection / removing a feature
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

>>> dora.remove_feature('useless_feature')
>>> dora.data
   A   B  C      D
0  1   2  0   left
1  4 NaN  1  right
2  7   8  2   left

# extract an ordinal feature through one-hot encoding
>>> dora.extract_ordinal_feature('D')
>>> dora.data
   A   B  C  D=left  D=right
0  1   2  0       1        0
1  4 NaN  1       0        1
2  7   8  2       1        0

# extract a transformation of another feature
>>> dora.extract_feature('C', 'twoC', lambda x: x * 2)
>>> dora.data
   A   B  C  D=left  D=right  twoC
0  1   2  0       1        0     0
1  4 NaN  1       0        1     2
2  7   8  2       1        0     4

Visualization

# plot a single feature against the output variable
dora.plot_feature('column-name')

# render plots of each feature against the output variable
dora.explore()

Model Validation

# create random partition of training / validation data (~ 80/20 split)
dora.set_training_and_validation()

# train a model on the data
X = dora.training_data[dora.input_columns()]
y = dora.training_data[dora.output]

some_model.fit(X, y)

# validate the model
X = dora.validation_data[dora.input_columns()]
y = dora.validation_data[dora.output]

some_model.score(X, y)

Data Versioning

# save a version of your data
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.snapshot('initial_data')

# keep track of changes to data
>>> dora.remove_feature('useless_feature')
>>> dora.extract_ordinal_feature('D')
>>> dora.impute_missing_values()
>>> dora.scale_input_values()
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107

>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

# use a previous version of the data
>>> dora.snapshot('transform1')
>>> dora.use_snapshot('initial_data')
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.logs
[]

# switch back to your transformation
>>> dora.use_snapshot('transform1')
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107
>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

Testing

To run the test suite, simply run python3 spec.py from the Dora directory.

Contribute

Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.

Additionally, feel free to submit pull requests which add features or address bugs yourself.

License

The MIT License (MIT)

Copyright (c) 2016 Nathan Epstein

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
Nathan Epstein
Nathan Epstein
Tools for calculating and visualizing Elo-like ratings of MLB teams using Retosheet data

Overview This project uses historical baseball games data to calculate an Elo-like rating for MLB teams based on regular season match ups. The Elo rat

Lukas Owens 0 Aug 25, 2021
Epagneul is a tool to visualize and investigate windows event logs

epagneul Epagneul is a tool to visualize and investigate windows event logs. Dep

jurelou 190 Dec 13, 2022
Python module for drawing and rendering beautiful atoms and molecules using Blender.

Batoms is a Python package for editing and rendering atoms and molecules objects using blender. A Python interface that allows for automating workflows.

Xing Wang 1 Jul 06, 2022
Flame Graphs visualize profiled code

Flame Graphs visualize profiled code

Brendan Gregg 14.1k Jan 03, 2023
Python Data. Leaflet.js Maps.

folium Python Data, Leaflet.js Maps folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js

6k Jan 02, 2023
The interactive graphing library for Python (includes Plotly Express) :sparkles:

plotly.py Latest Release User forum PyPI Downloads License Data Science Workspaces Our recommended IDE for Plotly’s Python graphing library is Dash En

Plotly 12.7k Jan 05, 2023
Draw interactive NetworkX graphs with Altair

nx_altair Draw NetworkX graphs with Altair nx_altair offers a similar draw API to NetworkX but returns Altair Charts instead. If you'd like to contrib

Zachary Sailer 206 Dec 12, 2022
The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information".

The HIST framework for stock trend forecasting The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining C

Wentao Xu 111 Jan 03, 2023
Certificate generating and sending system written in Python.

Certificate Generator & Sender How to use git clone https://github.com/saadhaxxan/Certificate-Generator-Sender.git cd Certificate-Generator-Sender Add

Saad Hassan 11 Dec 01, 2022
Pglive - Pglive package adds support for thread-safe live plotting to pyqtgraph

Live pyqtgraph plot Pglive package adds support for thread-safe live plotting to

Martin Domaracký 15 Dec 10, 2022
CPG represent!

CoolPandasGroup CPG represent! Arianna Brandon Enne Luan Tracie Project requirements: use Pandas to clean and format datasets use Jupyter Notebook to

Enne 3 Feb 07, 2022
The Metabolomics Integrator (MINT) is a post-processing tool for liquid chromatography-mass spectrometry (LCMS) based metabolomics.

MINT (Metabolomics Integrator) The Metabolomics Integrator (MINT) is a post-processing tool for liquid chromatography-mass spectrometry (LCMS) based m

Sören Wacker 0 May 04, 2022
Blender addon that creates a temporary window of any type from the 3D View.

CreateTempWindow2.8 Blender addon that creates a temporary window of any type from the 3D View. Features Can the following window types: 3D View Graph

3 Nov 27, 2022
Geospatial Data Visualization using PyGMT

Example script to visualize topographic data, earthquake data, and tomographic data on a map

Utpal Kumar 2 Jul 30, 2022
🎨 Python Echarts Plotting Library

pyecharts Python ❤️ ECharts = pyecharts English README 📣 简介 Apache ECharts (incubating) 是一个由百度开源的数据可视化,凭借着良好的交互性,精巧的图表设计,得到了众多开发者的认可。而 Python 是一门富有表达

pyecharts 13.1k Jan 03, 2023
Lime: Explaining the predictions of any machine learning classifier

lime This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support explaining individual predict

Marco Tulio Correia Ribeiro 10.3k Dec 29, 2022
Python Package for CanvasXpress JS Visualization Tools

CanvasXpress Python Library About CanvasXpress for Python CanvasXpress was developed as the core visualization component for bioinformatics and system

Dr. Todd C. Brett 5 Nov 07, 2022
Make visual music sheets for thatskygame (graphical representations of the Sky keyboard)

sky-python-music-sheet-maker This program lets you make visual music sheets for Sky: Children of the Light. It will ask you a few questions, and does

21 Aug 26, 2022
This project is an Algorithm Visualizer where a user can visualize algorithms like Bubble Sort, Merge Sort, Quick Sort, Selection Sort, Linear Search and Binary Search.

Algo_Visualizer This project is an Algorithm Visualizer where a user can visualize common algorithms like "Bubble Sort", "Merge Sort", "Quick Sort", "

Rahul 4 Feb 07, 2022
🧇 Make Waffle Charts in Python.

PyWaffle PyWaffle is an open source, MIT-licensed Python package for plotting waffle charts. It provides a Figure constructor class Waffle, which coul

Guangyang Li 528 Jan 02, 2023