sparkBy Pratik Mane6 min read

Narrow vs. Wide Transformations in PySpark

Introduction

One of the most fundamental concepts in Apache Spark is the distinction between narrow and wide transformations. Understanding this difference is critical for writing performant distributed data pipelines.

In short:

  • Narrow transformations operate on data within a single partition — no data shuffling needed.
  • Wide transformations require data to be redistributed across partitions — triggering a shuffle, which is expensive.

Narrow Transformations

Narrow transformations are operations where each input partition contributes to at most one output partition. These are fast because Spark can pipeline them within a single stage.

Common examples: map(), filter(), select(), withColumn()

narrow_transforms.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
 
spark = SparkSession.builder.appName("NarrowExample").getOrCreate()
 
# Read a dataset of sales transactions
df = spark.read.parquet("s3://data-lake/raw/sales/")
 
# These are all narrow transformations — no shuffle!
filtered = (
    df
    .filter(col("amount") > 100)
    .select("transaction_id", "customer_id", "amount", "region")
    .withColumn("region_upper", upper(col("region")))
)
 
filtered.show(5)

Because no data needs to move between executors, narrow transformations are extremely efficient and can be chained together in a single stage.

Wide Transformations

Wide transformations are operations where data from multiple input partitions must be combined into output partitions. This triggers a shuffle — Spark writes intermediate data to disk, redistributes it across the network, and then reads it back.

Common examples: groupBy(), join(), distinct(), repartition()

GroupBy Aggregation

wide_groupby.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, count
 
spark = SparkSession.builder.appName("WideExample").getOrCreate()
 
df = spark.read.parquet("s3://data-lake/raw/sales/")
 
# This is a WIDE transformation — triggers a shuffle
sales_by_region = (
    df
    .groupBy("region")
    .agg(
        spark_sum("amount").alias("total_revenue"),
        count("transaction_id").alias("num_transactions"),
    )
    .orderBy(col("total_revenue").desc())
)
 
sales_by_region.show()

Join Operation

Joins are arguably the most common wide transformation in real-world data pipelines. A sort-merge join (Spark's default for large datasets) requires both DataFrames to be shuffled and sorted by the join key.

wide_join.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
 
spark = SparkSession.builder.appName("JoinExample").getOrCreate()
 
# Two datasets
sales_df = spark.read.parquet("s3://data-lake/raw/sales/")
customers_df = spark.read.parquet("s3://data-lake/raw/customers/")
 
# Join is a WIDE transformation — both sides get shuffled by customer_id
enriched = (
    sales_df
    .join(customers_df, on="customer_id", how="inner")
    .select(
        "transaction_id",
        "customer_id",
        "customer_name",
        "region",
        "amount",
    )
)
 
enriched.show(10)

Performance Implications

AspectNarrowWide
Shuffle❌ No✅ Yes
Network I/OMinimalHeavy
Stage boundariesSame stageNew stage
Disk writesNoneIntermediate data
ScalabilityLinearDepends on data skew

Tips for Minimizing Shuffles

  1. Broadcast small tables — Use broadcast() for joins where one side is small enough to fit in memory.
  2. Pre-partition your data — If you frequently join on customer_id, write your Parquet files partitioned by that key.
  3. Use coalesce() instead of repartition()coalesce() is a narrow transformation that reduces partitions without a full shuffle.
  4. Filter early — Push filters as close to the data source as possible to reduce the volume of data entering a shuffle.

Conclusion

Every time you see a groupBy(), join(), or distinct() in your Spark code, remember: that's a shuffle boundary. The Spark UI will show you exactly where these occur in the DAG visualization.

Understanding the narrow vs. wide distinction is the first step toward writing Spark jobs that run in minutes instead of hours.

About the Author

Pratik Mane is a Senior Data Engineer specializing in Azure Data Platform and Databricks. He helps enterprises build scalable, high-performance ETL pipelines.