A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A Python Discord bot project generator

Heater Heat up a Discord bot in a blink What is Heater? Heater is a Command Line Interface tool which allows you to generate a barebones Python Discor

DevGuyAhnaf 5 Jan 14, 2022
A Python client for the Softcite software mention recognizer server

Softcite software mention recognizer client Python client for using the Softcite software mention recognition service. It can be applied to individual

4 Feb 02, 2022
A Discord bot to easily and quickly format your JSON data

Invite PrettyJSON to your Discord server Table of contents About the project What is JSON? What is pretty printing? How to use Input options Command I

Sem 4 Jan 24, 2022
Telegram forwarder

Telegram Forwarder Quick Start This application using docker, docker-compose to run. So I suppose that you can install these two things. Prepare essen

10 Dec 20, 2022
Program that uses Python to monitor grade updates in the Genesis Platform

Genesis-Grade-Monitor Program that uses Python to monitor grade updates in the Genesis Platform Guide: Install by either cloning the repo or downloadi

Steven Gatanas 1 Feb 12, 2022
A youtube search telegram bot.

YouTube-Search-Bot A youtube search telegram bot. Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github

Fayas Noushad 22 Nov 12, 2022
Projeto de estudantes do primeiro período do CIn - UFPE voltado para a criação de um sistema interativo no fechamento da disciplina IF669 - Introdução a Programação.

Projeto Game: Dona da Lua Alunos: Beatriz Férre Clara Kenderessy Matheus Silva Rafael Baltar Roseane Oliveira Samuel Marsaro Sinopse O Cebolinha apron

Maria Clara Kenderessy 5 Dec 20, 2021
A simple Python script using Telethon to log all (or some) messages a user or bot account can see on Telegram.

telegram-logger A simple Python script using Telethon to log all (or some) messages a user or bot account can see on Telegram. Requirements Python 3.6

Richard 13 Oct 06, 2022
A basic implementation of the Battlesnake API in Python

Getting started with Battlesnake and Python This is a basic implementation of the Battlesnake API in Python. It's a great starting point for anyone wa

Gaurav Batra 2 Dec 08, 2021
Discord Bot for bugbounty Web

BugbountyBot Discord Bot for Bug Bounty Web The purpose of this bot is to automa

Beek Labs 6 May 03, 2022
An incomplete add-on extension to Pyrogram, to create telegram bots a bit more easily

PyStark A star ⭐ from you means a lot An incomplete add-on extension to Pyrogram

Stark Bots 36 Dec 23, 2022
A discord program that will send a message to nearly every user in a discord server

Discord Mass DM Scrapes users from a discord server to promote/mass dm Report Bug · Request Feature Features Asynchronous Easy to use Free Auto scrape

dropout 56 Jan 02, 2023
Simple PoC script that allows you to exploit telegram's "send with timer" feature by saving any media sent with this functionality.

Simple PoC script that allows you to exploit telegram's "send with timer" feature by saving any media sent with this functionality.

Matteo 52 Nov 29, 2022
AWS Auto Inventory allows you to quickly and easily generate inventory reports of your AWS resources.

Photo by Denny Müller on Unsplash AWS Automated Inventory ( aws-auto-inventory ) Automates creation of detailed inventories from AWS resources. Table

AWS Samples 123 Dec 26, 2022
A Anything goes Discord bot written in python and uses the wrapper Discord.py

GerardTheWizard A Anything goes Discord bot written in python and uses the wrapper Discord.py What can he do? Allow users to level up through typing,

1 May 05, 2022
Telegram bot for downloading covid-19 vaccine certificate

cowin-certificate-bot This is the source code of @cowincertbot, A telegram bot inspired by the whatsapp bot implementation of indian government for co

ArUn Pt 30 Oct 07, 2022
Create light scenes , voice control, ifttt, fuzzywuzzy speech correction and much more with Tuya light bulbs.

LightBox Features: Auto discover tuya lights Set and create moods (aka: light profiles) Change moods via IFTTT List moods via IFTTT FuzzyWuzzy, speech

Robert Nagtegaal 1 Dec 20, 2021
Pluggable Telethon - Telegram UserBot

A stable pluggable Telegram userbot, based on Telethon.

Team Ultroid 2.3k Dec 30, 2022
Emo-Fun is a bot which emojifies the text you send it

About Emo-Fun is a bot which emojifies the text you send it. It is easier to understand by an example Input : Hey this is to show my working!! Output

Suvodeep Sinha 3 Sep 30, 2022
Seamlessly Connecting Notion Database with Python Pandas DataFrame

notion-df: Seamlessly Connecting Notion Database with Pandas DataFrame Please Note: This project is currently in pre-alpha stage. The code are not app

Shannon Shen 38 Dec 28, 2022