Narrow vs. Wide Transformations in PySpark

By Pratik Mane— March 15, 2026

Introduction

One of the most fundamental concepts in Apache Spark is the distinction between narrow and wide transformations. Understanding this difference is critical for writing performant distributed data pipelines.

In short:

Narrow transformations operate on data within a single partition — no data shuffling needed.
Wide transformations require data to be redistributed across partitions — triggering a shuffle, which is expensive.

Narrow Transformations

Narrow transformations are operations where each input partition contributes to at most one output partition. These are fast because Spark can pipeline them within a single stage.

Common examples: map(), filter(), select(), withColumn()

narrow_transforms.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
 
spark = SparkSession.builder.appName("NarrowExample").getOrCreate()
 
# Read a dataset of sales transactions
df = spark.read.parquet("s3://data-lake/raw/sales/")
 
# These are all narrow transformations — no shuffle!
filtered = (
    df
    .filter(col("amount") > 100)
    .select("transaction_id", "customer_id", "amount", "region")
    .withColumn("region_upper", upper(col("region")))
)
 
filtered.show(5)

Because no data needs to move between executors, narrow transformations are extremely efficient and can be chained together in a single stage.

Wide Transformations

Wide transformations are operations where data from multiple input partitions must be combined into output partitions. This triggers a shuffle — Spark writes intermediate data to disk, redistributes it across the network, and then reads it back.

Common examples: groupBy(), join(), distinct(), repartition()

GroupBy Aggregation

wide_groupby.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, count
 
spark = SparkSession.builder.appName("WideExample").getOrCreate()
 
df = spark.read.parquet("s3://data-lake/raw/sales/")
 
# This is a WIDE transformation — triggers a shuffle
sales_by_region = (
    df
    .groupBy("region")
    .agg(
        spark_sum("amount").alias("total_revenue"),
        count("transaction_id").alias("num_transactions"),
    )
    .orderBy(col("total_revenue").desc())
)
 
sales_by_region.show()

Join Operation

Joins are arguably the most common wide transformation in real-world data pipelines. A sort-merge join (Spark's default for large datasets) requires both DataFrames to be shuffled and sorted by the join key.

wide_join.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
 
spark = SparkSession.builder.appName("JoinExample").getOrCreate()
 
# Two datasets
sales_df = spark.read.parquet("s3://data-lake/raw/sales/")
customers_df = spark.read.parquet("s3://data-lake/raw/customers/")
 
# Join is a WIDE transformation — both sides get shuffled by customer_id
enriched = (
    sales_df
    .join(customers_df, on="customer_id", how="inner")
    .select(
        "transaction_id",
        "customer_id",
        "customer_name",
        "region",
        "amount",
    )
)
 
enriched.show(10)

Performance Implications

Aspect	Narrow	Wide
Shuffle	❌ No	✅ Yes
Network I/O	Minimal	Heavy
Stage boundaries	Same stage	New stage
Disk writes	None	Intermediate data
Scalability	Linear	Depends on data skew

Tips for Minimizing Shuffles

Broadcast small tables — Use broadcast() for joins where one side is small enough to fit in memory.
Pre-partition your data — If you frequently join on customer_id, write your Parquet files partitioned by that key.
Use coalesce() instead of repartition() — coalesce() is a narrow transformation that reduces partitions without a full shuffle.
Filter early — Push filters as close to the data source as possible to reduce the volume of data entering a shuffle.

Conclusion

Every time you see a groupBy(), join(), or distinct() in your Spark code, remember: that's a shuffle boundary. The Spark UI will show you exactly where these occur in the DAG visualization.

Understanding the narrow vs. wide distinction is the first step toward writing Spark jobs that run in minutes instead of hours.

Narrow vs. Wide Transformations in PySpark

Introduction

Narrow Transformations

Wide Transformations

GroupBy Aggregation

Join Operation

Performance Implications

Tips for Minimizing Shuffles

Conclusion

Pratik Mane

Keep Reading

How Apache Spark Works Under the Hood (Explained Simply)

Data Warehousing Fundamentals & Kimball Dimensional Modeling