Semi-Automated Data Processing

Overview

Semi-Automated Data Processing

Preparing data for model learning is one of the most important steps in any project—and traditionally, one of the most time consuming. Data Analysis plays a very important role in the entire Data Science Workflow. In fact, this takes most of the time of the Data science Workflow. There’s a nice quote (not sure who said it)According to Wikipedia, In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.**“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”*

This projects handles the task with minimal user interaction by analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the project in semi-interactive fashion, previewing the changes before they are made and accept or reject them as you want.

This project cover the 3 steps in any project workflow, comes before the model training:
1) Exploratory data analysis
2) Feature engineering
3) Feature selection


All these steps has to be carried out by the user by calling the several functions as follows:

1) identify_feature(data)=
This function identifies the categorical, continuous numerical and discrete numerical features in the datset. It also identifies datetime feature and extracts the relevant info from it.

Input:
data=Dataset

Output:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

2) plot_nan_feature(data, continuous_features, discrete_features, categorical_features,dependent_var)=
It identifies the missing values in the dataset and visualize them their impact on dependent feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
dependent_var= Dependent feature name in string format

Output:
df= Dataset
nan_features= List of feature names containing NaN values

3) visualize_imputation_impact(data,continuous_features, discrete_features, categorical_features,nan_features,dependent_var):
The function visualizes the impact of different NaN value impution on the distribution of values the feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
nan_features= List of feature names containing NaN values
dependent_var= Dependent feature name in string format

Output:
None

4) nan_imputation(data,mean_feature,median_feature,mode_feature,random_feature,new_category):
The function imputes the NaN values in the feature as per the user input.

Input:
data=Dataset
mean_feature= List of feature names in which we have to carry out mean_imputation
median_feature=List of feature names in which we have to carry out median_imputation
mode_feature=List of feature names in which we have to carry out mode_imputation
random_feature=List of feature names in which we have to carry out random_imputation
new_category=List of feature names in which we we create a new category for the NaN values

Output:
None

5) cross_visualization(data,continuous_features,discrete_features, categorical_features,dt_features):
The function visualise the relationship between the different independent features.

Input:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

Output:
continuous_features2=List of features names associated containing continuous numerical values, except the dependent feature

6) dependent_independent_visualization(data,continuous_features,discrete_features, categorical_features,dt_features,dependent_feature):
The function visualise the relationship between the different independent features.

Input:
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values
dependent_var= Dependent feature name in string format

Output:
None

7) outlier_removal(data,continuous_features,discrete_features,dependent_var,dependent_var_type,action):
The function visualizes the outlliers using the boxplot and removes them.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_var= Dependent feature name in string format
dependent_var_type= Contain string tells if the problem is regression (than use 'Regression') or else
action= Give input as 'remove' to delete the rows associated with the outliers

Output:
df=Dataset

8) transformation_visualization(data,continuous_features,discrete_features,dependent_feature):
The function visualize the feature after performing various transormation techniques.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_feature= Dependent feature name in string format

Output:
None

9) feature_transformation(train_data,continuous_features,discrete_features,transformation,dependent_feature):
The function performing the feature transormation technique as per the user input.

Input:
train_data=Training dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
transformation=Type of transformation: none=No transformation, log=Log Transformation, sqrt= Square root Transformation, reciprocal= Reciprocal Transformation, exp= Exponential Transformation, boxcox=Boxcox Transformation
dependent_feature= Dependent feature name in string format

Output:
X_data=Training dataset

10) categorical_transformation(train_data,categorical_encoding):
This function transforms the categorical featres in the numerical ones using encoding techniques.

Input:
train_data=Training dataset
categorical_encoding={'one_hot_encoding':[],'frequency_encoding':[],'mean_encoding':[],'target_guided_ordinal_encoding':{}}

Output:
X_data=Training dataset

11a) feature_selection(Xtrain,ytrain, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
feature_df= Dataframe containig features with their pvalue

11b) feature_selection(Xtrain,ytrain,Xtest,ytest, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
Xtest=Test dataset
ytest=dependent data in test dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
Xtest= Test dataset
feature_df= Dataframe containig features with their pvalue

12) convert_dtype(data,categorical_features):
This function converts the categorical fetaures containing the numeric values but presented as categorical into the int format.

Input:
data= Dataset
categorical_features=List of features names associated containing categorical values

Output:
df=Dataset

Note:
Use same paramters for both train and test dataset for better accuracy


We have implemented a bike sharing project to describe how the functions can be used for both the classification and regression problem statement.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
A DSL for data-driven computational pipelines

"Dataflow variables are spectacularly expressive in concurrent programming" Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum Quick overview Ne

1.9k Jan 03, 2023
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

Oxford Nanopore Technologies 72 Jan 05, 2023