Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Last update: Feb 11, 2022

Overview

Disaster Response Pipeline Project

Introducton

Project Describtion:

In this Project, I analyzed the attached datasets file contains tweet and messages a real life disaster responses. The aim of the project is to build a Natural Language Processing tool or API that classifies the recieved messages as the following sample screenshot.

Preprocessing

I had a preprocessing statge which found at data/process_data.py, it's containing an ETL pipeline to do the following:

Reading data from the csv files disaster_messages.csv and disaster_categories.csv.
Both the messages and the categories datasets are merged.
Cleaning merged dataframe .
Duplicated mesages are removed.
storeing cleaned data over data/DisasterResponse.db.

Machine Learning Pipeline

ML pipeline is implemented in models/train_classifier.py.

Exort the data from data/DisasterResponse.db.
Splitting dataframe trainging and testing sets.
A function tokenize() is implemented to clean the messages data and tokenize it for tf-idfcalculations.
Pipelines are implemented for text and machine learning processing.
Parameter selection is based on GridSearchCV.
Trained classifier is stored in models/classifier.pkl.

Flask App

Flask app is implemented in the app folder. Main page gives data overview as shown in the attached images. Main target is to leave the message the the msg box and it will categorize the message in its genre.

Data Overview:

There are over 20,000 messages are related to a distaster.

News Messages are the highest while social media has the least!

Messages target Features distributed as the following:

Instructions:

Run the following commands in the project's root directory to set up your database and model.
- To run ETL pipeline that cleans data and stores in database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
Run the following command in the app's directory to run your web app. python run.py
Go to http://0.0.0.0:3001/

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Related tags

Overview

Disaster Response Pipeline Project

Introducton

Project Describtion:

Preprocessing

Machine Learning Pipeline

Flask App

Data Overview:

Instructions:

Owner

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Scraping and analysis of leetcode-compensations page.

A model checker for verifying properties in epistemic models

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Data pipelines built with polars

TextDescriptives - A Python library for calculating a large variety of statistics from text

The lastest all in one bombing tool coded in python uses tbomb api

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Full automated data pipeline using docker images

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Titanic data analysis for python

Synthetic Data Generation for tabular, relational and time series data.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.