This is a repo documenting the best practices in PySpark.

Last update: Dec 25, 2022

Related tags

Overview

Spark-Syntax

This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.

you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.

Contributing/Topic Requests

If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁 , you'll most likely be right.

If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 😁 .

Acknowledgement

Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.

Table of Contexts:

Chapter 1 - Getting Started with Spark:

Chapter 2 - Exploring the Spark APIs:

2.1 - Non-Trivial Data Structures in Spark
- 2.1.1 - Struct Types (StructType)
- 2.1.2 - Arrays and Lists (ArrayType)
- 2.1.3 - Maps and Dictionaries (MapType)
- 2.1.4 - Decimals and Why did my Decimals overflow :( (DecimalType)
2.2 - Performing your First Transformations
- 2.2.1 - Looking at Your Data (collect/head/take/first/toPandas/show)
- 2.2.2 - Selecting a Subset of Columns (drop/select)
- 2.2.3 - Creating New Columns and Transforming Data (withColumn/withColumnRenamed)
- 2.2.4 - Constant Values and Column Expressions (lit/col)
- 2.2.5 - Casting Columns to a Different Type (cast)
- 2.2.6 - Filtering Data (where/filter/isin)
- 2.2.7 - Equality Statements in Spark and Comparisons with Nulls (isNotNull()/isNull())
- 2.2.8 - Case Statements (when/otherwise)
- 2.2.9 - Filling in Null Values (fillna/coalesce)
- 2.2.10 - Spark Functions aren't Enough, I Need my Own! (udf/pandas_udf)
- 2.2.11 - Unionizing Multiple Dataframes (union)
- 2.2.12 - Performing Joins (clean one) (join)
2.3 More Complex Transformations
- 2.3.1 - One to Many Rows (explode)
- 2.3.2 - Range Join Conditions (WIP) (join)
2.4 Potential Performance Boosting Functions
- 2.4.1 - (repartition)
- 2.4.2 - (coalesce)
- 2.4.2 - (cache)
- 2.4.2 - (broadcast)

Chapter 3 - Aggregates:

3.1 - Clean Aggregations
3.2 - Non Deterministic Behaviours

Chapter 4 - Window Objects:

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

6.1 - Primer to Understanding Your Spark Application
- 6.1.1 - Understanding how Spark Works
- 6.1.2 - Understanding the SparkUI
- 6.1.3 - Understanding how the DAG is Created
- 6.1.4 - Understanding how Memory is Allocated
6.2 - Analyzing your Spark Application
- 6.1 - Looking for Skew in a Stage
- 6.2 - Looking for Skew in the DAG
- 6.3 - How to Determine the Number of Partitions to Use
6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

7.0 - The Types of Join Strategies in Spark
- 7.0.1 - You got a Small Table? (Broadcast Join)
- 7.0.2 - The Ideal Strategy (BroadcastHashJoin)
- 7.0.3 - The Default Strategy (SortMergeJoin)
7.1 - Improving Joins
- 7.1.1 - Filter Pushdown
- 7.1.2 - Joining on Skewed Data (Null Keys)
- 7.1.3 - Joining on Skewed Data (High Frequency Keys I)
- 7.1.4 - Joining on Skewed Data (High Frequency Keys II)
- 7.1.5 - Join Ordering
7.2 - Repeated Work on a Single Dataset (caching)
- 7.2.1 - caching layers
7.3 - Spark Parameters
- 7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)
- 7.3.2 - The magical number 2001 (partitions)
- 7.3.3 - Using a lot of UDFs? (python memory)
7. - Bloom Filters :o?

This is a repo documenting the best practices in PySpark.

Related tags

Overview

Spark-Syntax

Contributing/Topic Requests

Acknowledgement

Table of Contexts:

Chapter 1 - Getting Started with Spark:

1.1 - Useful Material

1.2 - Creating your First DataFrame

1.3 - Reading your First Dataset

1.4 - More Comfortable with SQL?

Chapter 2 - Exploring the Spark APIs:

2.1 - Non-Trivial Data Structures in Spark

2.1.1 - Struct Types (StructType)

2.1.2 - Arrays and Lists (ArrayType)

2.1.3 - Maps and Dictionaries (MapType)

2.1.4 - Decimals and Why did my Decimals overflow :( (DecimalType)

2.2 - Performing your First Transformations

2.2.1 - Looking at Your Data (collect/head/take/first/toPandas/show)

2.2.2 - Selecting a Subset of Columns (drop/select)

2.2.3 - Creating New Columns and Transforming Data (withColumn/withColumnRenamed)

2.2.4 - Constant Values and Column Expressions (lit/col)

2.2.5 - Casting Columns to a Different Type (cast)

2.2.6 - Filtering Data (where/filter/isin)

2.2.7 - Equality Statements in Spark and Comparisons with Nulls (isNotNull()/isNull())

2.2.8 - Case Statements (when/otherwise)

2.2.9 - Filling in Null Values (fillna/coalesce)

2.2.10 - Spark Functions aren't Enough, I Need my Own! (udf/pandas_udf)

2.2.11 - Unionizing Multiple Dataframes (union)

2.2.12 - Performing Joins (clean one) (join)

2.3 More Complex Transformations

2.3.1 - One to Many Rows (explode)

2.3.2 - Range Join Conditions (WIP) (join)

2.4 Potential Performance Boosting Functions

2.4.1 - (repartition)

2.4.2 - (coalesce)

2.4.2 - (cache)

2.4.2 - (broadcast)

Chapter 3 - Aggregates:

3.1 - Clean Aggregations

3.2 - Non Deterministic Behaviours

Chapter 4 - Window Objects:

4.1 - Default Ordering on a Window Object

4.2 - Ordering High Frequency Data with a Window Object

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

6.1 - Primer to Understanding Your Spark Application

6.1.1 - Understanding how Spark Works

6.1.2 - Understanding the SparkUI

6.1.3 - Understanding how the DAG is Created

6.1.4 - Understanding how Memory is Allocated

6.2 - Analyzing your Spark Application

6.1 - Looking for Skew in a Stage

6.2 - Looking for Skew in the DAG

6.3 - How to Determine the Number of Partitions to Use

6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

7.0 - The Types of Join Strategies in Spark

7.0.1 - You got a Small Table? (Broadcast Join)

7.0.2 - The Ideal Strategy (BroadcastHashJoin)

7.0.3 - The Default Strategy (SortMergeJoin)

7.1 - Improving Joins

7.1.1 - Filter Pushdown

7.1.2 - Joining on Skewed Data (Null Keys)

7.1.3 - Joining on Skewed Data (High Frequency Keys I)

7.1.4 - Joining on Skewed Data (High Frequency Keys II)

7.1.5 - Join Ordering

7.2 - Repeated Work on a Single Dataset (caching)

7.2.1 - caching layers

7.3 - Spark Parameters

7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)

7.3.2 - The magical number 2001 (partitions)

7.3.3 - Using a lot of UDFs? (python memory)

7. - Bloom Filters :o?

Owner

Eric Xiao

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

Bamboolib - a GUI for pandas DataFrames

2.1.1 - Struct Types (`StructType`)

2.1.2 - Arrays and Lists (`ArrayType`)

2.1.3 - Maps and Dictionaries (`MapType`)

2.1.4 - Decimals and Why did my Decimals overflow :( (`DecimalType`)

2.2.1 - Looking at Your Data (`collect`/`head`/`take`/`first`/`toPandas`/`show`)

2.2.2 - Selecting a Subset of Columns (`drop`/`select`)

2.2.3 - Creating New Columns and Transforming Data (`withColumn`/`withColumnRenamed`)

2.2.4 - Constant Values and Column Expressions (`lit`/`col`)

2.2.5 - Casting Columns to a Different Type (`cast`)

2.2.6 - Filtering Data (`where`/`filter`/`isin`)

2.2.7 - Equality Statements in Spark and Comparisons with Nulls (`isNotNull()`/`isNull()`)

2.2.8 - Case Statements (`when`/`otherwise`)

2.2.9 - Filling in Null Values (`fillna`/`coalesce`)

2.2.10 - Spark Functions aren't Enough, I Need my Own! (`udf`/`pandas_udf`)

2.2.11 - Unionizing Multiple Dataframes (`union`)

2.2.12 - Performing Joins (clean one) (`join`)

2.3.1 - One to Many Rows (`explode`)

2.3.2 - Range Join Conditions (WIP) (`join`)

2.4.1 - (`repartition`)

2.4.2 - (`coalesce`)

2.4.2 - (`cache`)

2.4.2 - (`broadcast`)

7.0.1 - You got a Small Table? (`Broadcast Join`)

7.0.2 - The Ideal Strategy (`BroadcastHashJoin`)

7.0.3 - The Default Strategy (`SortMergeJoin`)

7.2 - Repeated Work on a Single Dataset (`caching`)

7.3.1 - Running Multiple Spark Applications at Scale (`dynamic allocation`)

7.3.2 - The magical number `2001` (`partitions`)

7.3.3 - Using a lot of `UDF`s? (`python memory`)