Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

CSV database for chihuahua (HUAHUA) blockchain transactions

A simplified prototype for an as-built tracking database with API

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

A set of functions and analysis classes for solvation structure analysis

Python implementation of Principal Component Analysis

Yet Another Workflow Parser for SecurityHub

Catalogue data - A Python Scripts to prepare catalogue data

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Synthetic Data Generation for tabular, relational and time series data.

The lastest all in one bombing tool coded in python uses tbomb api

Python for Data Analysis, 2nd Edition

Learn machine learning the fun way, with Oracle and RedBull Racing

Python utility to extract differences between two pandas dataframes.

Manage large and heterogeneous data spaces on the file system.

Repository created with LinkedIn profile analysis project done

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Using approximate bayesian posteriors in deep nets for active learning

Display the behaviour of a realtime program with a scope or logic analyser.

Approximate Nearest Neighbor Search for Sparse Data in Python!