Data Warehousing Fundamentals & Kimball Dimensional Modeling

By Pratik ManeApril 18, 2026

With the rapid rise of modern Lakehouses and cloud-native object storage, it is easy to assume that traditional data warehousing is a relic of the past. But the truth is, while the storage layer has evolved, the logic of how we structure data for human understanding hasn't changed. If you want to build scalable, high-performance analytical systems, mastering Kimball’s dimensional modeling isn't just optional, it is the absolute foundation of data engineering.

This is a lengthy post, hoping to cover all the foundational data warehousing concepts.


Lets start by going over basic term and concepts used in data warehousing.

Basic terms and concepts

OLTP vs OLAP

FeatureOLTP (Online Transaction Processing)OLAP (Online Analytical Processing)
PurposeDay-to-day operationsData analysis and decision support
Data ModelHighly normalized (3NF)Dimensional (Star/Snowflake)
OperationsInsert, Update, DeleteSelect (Read-heavy)
UsersClerks, Cashiers, End-usersAnalysts, Executives
Data VolumeSmall, recent transactionsHistorical, large volumes
Response TimeMillisecondsSeconds to minutes
ExamplesBanking, eCommerce appsBI tools, dashboards

Diagram showing Database vs Datawarehouse bs Datamart

What is a Database?

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. It is designed to store, manage, and retrieve data efficiently. It is usually designed to handle OLTP, i.e. for example, backend of application at counter of food store.

Note: We are not talking from hardware perspective, but how when these terms are used in data engineering domain.

What is a Data Warehouse?

A data warehouse is a centralized repository designed to store integrated data from multiple sources for analytical and reporting purposes. It supports historical analysis, business intelligence, and decision-making. Now, this can be a database, a lakehouse in databricks/spark etc. depending on requirements.

What is a Data Mart?

A data mart is a focused subset of data warehouse designed to meet the needs of a specific department or business function. It contains only the data relevant to that department.

Database vs Data Warehouse vs Data Mart

FeatureDatabaseData WarehouseData Mart
PurposeStores current operational dataCentral repository for historical, analytical dataFocused subset of data warehouse
ModelER model (normalized)Dimensional/Star schemaStar schema
UsersApplication users, developersAnalysts, BI professionalsDepartment users
ScopeEnterprise-wide transactionsEnterprise-wide analyticsDepartment-specific (e.g., Sales)
Update FrequencyReal-time or near real-timeBatch/ETL-drivenBatch

Now lets look at the core concepts in datawarehousing and dimensional modelling.

What is a Data Warehouse

Imagine sufficiently large company wants any insights regarding how their comapny is performing (KPIs) or want to dig into some data because something's gone wrong or want to implement some ML usecase. They cannot go and query their EPR/CRMs/APIs or whatever their data source is everytime, this practice won't sustain.

That's why a data warehouse is necessary. As we discussed earlier, its a centralized repository designed to store integrated data from multiple sources for analytical and reporting purposes. It supports historical analysis, business intelligence, and decision-making.

Key Features

Data warehouses are optimized for reading large volumes of data unlike OLTP, becuase once the datawarehouse is in place, its written as per the refresh frequency, but is read quite frequently - by BI reports that get refreshed, ad-hoc queries by business users, downstream jobs. Targeted use cases are like trend analysis, forecasting, and executive dashboards etc.

In a very simple language, data warehouse built by kimball methodology is collection of fact and dimension tables modelled by business on top of which BI reports are built. Modelling in simple terms, is deciding what facts and dimensions should be created and what columns should those tables have.

Data Warehouse vs Lakehouse

Traditional data warehouses were prominently built using RDBMS systems like SQL Server, Oracle DB etc. Modern data wareshousing is done using 'lakehouses', you can find detailed information about it here.

FeatureData WarehouseLakehouse
StorageExpensive DB (SQL Server, Teradata)Cheap object storage (S3, ADLS)
Data typeStructured onlyAll types
CostHighLow
ToolsSSIS, InformaticaSpark, Databricks, ADF
ScalabilityLimited (vertical scaling)Massive (horizontal scaling)

Data Modelling is done in similar way in both; the difference is in implementation

Two Approaches to Data Warehousing

There are two major approaches to data warehousing:

Inmon (Top-Down):

Kimball (Bottom-Up):

Kimball’s method is popular - faster delivery, easier for business users, and good for iterative development

Kimball’s Dimensional Modeling

Dimensional modeling is a technique to structure data to be intuitive and performant for querying.

Two key components:

Goal: Create fact and dimension tables to make the data easy to understand and efficient to analyze.

High-Level Kimball Process

  1. Identify Business Process: Example: Orders, Payments, Shipments, Inventory -> This defines the scope of your fact table

    Don't just build a generic Fct_Sales table. Ask the stakeholders if they care about the checkout process, the shipping process, or returns. Each distinct business event gets its own fact table. Why? Because the process dictates the physical reality you are modeling. Once you isolate the specific process (e.g., "Customer pays for cart"), the raw columns naturally reveal themselves (transaction ID, payment method, tax amount). If you mix processes like adding shipping dates to a checkout fact - your data model will inevitably become a nightmare of NULL values and tangled logic.

  2. Declare the Grain: What does one row in the fact table represent? Ex: one row per line item of transaction.

    Business users always ask for aggregated data upfront ("I want to see monthly sales"). Push back. The grain is the single most critical decision in your entire warehouse design. If you get it wrong, your data is fundamentally broken for future use cases. If you declare the grain as "daily sales per store," but tomorrow marketing asks for "morning vs. evening sales," you are completely stuck. Always store data at the lowest possible atomic level (like one row per scanned barcode at the register) to future-proof your analytics. You can always aggregate up, but you can never drill down.

  3. Identify Dimensions: Who, what, where, when, how; ex: Customer, Product

    Listen closely for the word "by" in meetings. "I want to see sales by store, by date, by customer". Those are your dims. A robust dimension table shouldn't just have an ID and a Name; it should be incredibly wide, packed with every descriptive attribute (category, sub-category, manager name, region) you can find. This empowers business users to slice and dice the metrics without writing complex SQL or submitting Jira tickets for new columns. Make sure these are conformed—meaning you use the exact same DimCustomer table across all your fact tables to prevent data silos.

  4. Identify Facts (Measures): Numeric, additive metrics

    Keep these purely additive (like Quantity, Sales_Amount). A fact must be true to the grain you just declared. If the grain is one line item on a receipt, the fact is the exact dollar amount for that specific item, not the total receipt amount. Furthermore, strictly separate additive facts from non-additive calculations (like Profit_Margin). Calculate percentages on the fly in Power BI or your semantic layer. This guarantees that when a CEO looks at a high-level dashboard and an analyst looks at a granular report, the underlying numbers always match perfectly.

  5. Design Star Schema (Logical Model): Central fact table + surrounding dimensions

    Grab a whiteboard. Put the fact table in the middle and draw lines out to your dims. Do not skip this step and jump straight into coding. This diagram is your contract with the business. When stakeholders see how the central process connects to the entities they care about, they will immediately spot missing requirements. ("Wait, where is the promotion code?") Fixing it on the whiteboard takes five minutes; fixing it in production takes three weeks.*

  6. Define Keys & SCD Strategy: Surrogate keys, SCD

    Never trust source system IDs (Natural Keys). Systems change and IDs get recycled, which will break your joins. Generate integer Surrogate Keys. Also, ask the hard questions: "If a user moves from Pune to Mumbai, do you want their past sales to show as Pune or Mumbai?" SCD Type 2 (history tracking) is what makes a warehouse vastly superior to a simple database replica, allowing true point-in-time reporting. But it is expensive to compute. Apply Type 2 only to critical attributes (like Region) and stick to Type 1 (overwrite) for trivial attributes (like a phone number).

  7. Source-to-Target Mapping: Column-level mapping, transformations, business rules

    This is the unglamorous but necessary reality of data engineering. Source systems are usually a mess of obscure column names (like usr_fld_12) and inconsistent data types. Your mapping document is the blueprint that translates that chaos into your pristine star schema. It is also where you define hard rules for data anomalies—what happens when a sale arrives without a valid customer ID? (Hint: you assign a default "-1" Unknown Customer SK).

  8. Physical Model Design: Data types, Partitioning, Indexing (if warehouse), file formats

    A logical star schema is useless if it's slow. The logical model tells you what the data means; the physical model dictates how fast you can read it. Are you on Synapse? Databricks? Fabric? If you partition a massive table by Order_Date but 90% of your queries filter by Customer_ID, performance will tank. Figure out your partitioning strategy, file formats (Parquet/Delta), and things like Z-Ordering or Liquid Clustering to minimize data scanning.

  9. ETL / ELT Design

    Build the automated pipelines (ADF, Databricks, Airflow). Modern ELT emphasizes doing the heavy lifting within the warehouse (using tools like dbt) rather than on external servers. Focus heavily on modularity and idempotency here. If the API pulling exchange rates goes down, it shouldn't crash the pipeline loading customer addresses. Build resilient DAGs that gracefully handle failures, retry automatically, and ensure your dependencies are set so Dims always load before Facts.

  10. Build & Validate

What is Normalization and Denormalization?

Diagram showing Normalization vs Denormalization

To understand why we design data warehouses the way we do, look at the visual above.

Before Normalization (Flat Table) Notice the redundancy: Student "Abhi" and Course "DBMS" are repeated. In an application database (OLTP), this is dangerous. If the instructor for DBMS changes, the application must update multiple rows. If one update fails, the database is corrupted with conflicting data.

After Normalization We fix this by splitting the data into distinct entities (dim_student and dim_course) connected by a central mapping table. Now, if an instructor changes, we update exactly one row in the dim_course table. Zero risk of data corruption.

Software engineers achieve this normalized state by following three core rules:

The Data Engineering Reality (Denormalization) Strict 3NF is the gold standard for writing data safely in OLTP systems. However, Data Warehouses (OLAP) are built for reading data. Joining multiple normalized tables across a billion rows requires heavy network shuffling, which is incredibly slow.

Therefore, Data Engineers intentionally denormalize the data. We trade storage space, reintroducing controlled redundancy to build Star Schemas or One Big Tables (OBT) that allow massive analytical queries to return in milliseconds.

Normalization is critical in transactional systems to avoid data redundancy & prevent update/insert/delete anomaly

Normalization

Denormalization

Why Denormalize in Data Warehousing?

Fact Tables

Store measurable, quantitative data Typically contain:

Types of Fact Tables:

You can read more about the different types here

Granularity

Dimension Tables

Contain descriptive attributes that provide context to facts Example: Customer, Product, Time, Georgraphy

Characteristics:

Star Schema

Diagram showing Star Schema

A simple, de-normalized structure

Pros:

Notice the direction of arrows, they flow from dimension to fact. This signifies that data from dimension table will filter the data in fact table in BI reports (ex: PowerBI), never the other way around. This also tells us dimension tables primarily serve the purpose of slicing and dicing data in fact tables in reporting.

Snowflake Schema

Diagram showing Snowflake Schema

A normalized variant of the star schema

Pros:

Cons:

It saves a tiny bit of storage space, but it forces your database to perform three separate JOINs just to find out the Category of a sold item. That extra "hop" is exactly why it looks like a snowflake on a diagram, and exactly why it runs slower in an analytical data warehouse.

Slowly Changing Dimensions (SCD)

Note: Type 2 is most common—when customer address changes, we add a new row to track history.

When to use SCD 1 & SCD 2?

SCD1 if business only cares about latest value

SCD2 if historical tracking is needed.

Real Scenario: If a customer moves from Pune to Mumbai, should old sales show Pune or Mumbai? Regulatory / Financial Impact? -> If YES → almost always Type 2

Types of Keys in Dimensional Modeling

Key TypePurposeExample
Primary KeyUniquely identifies a record in a tableCustomer_ID in Customer table
Foreign KeyReferences primary key in another tableCustomer_ID in Sales fact table
Surrogate KeyArtificial unique identifier (e.g., integer)Product_SK (auto-incremented)
Natural KeyReal-world identifier from source systemsSSN, Email
Unique KeyEnforces uniqueness on a column or set of columnsEmail in Customer table
Alternate KeyAny candidate key not selected as primary keyNational_ID if Customer_ID is primary

Note: Surrogate keys are preferred in dimensional modeling: avoid source system dependencies and simplify SCD handling. Natural keys can change (e.g., phone number), which makes them unstable.

Surrogate keys

Best Practices in Dimensional Modeling

Resources & Further Reading


Interview Questions on Data Warehousing

If you are preparing for a Data Engineering interview, expect to be tested heavily on these concepts. Here are a few advanced questions to practice:

1. When would you choose to build a Snowflake schema instead of a Star schema?

2. Explain the exact difference between SCD Type 1 and Type 2. When would a business mandate Type 2?

3. Why is it dangerous to use Natural Keys directly in your Fact Tables?


TL;DR / Summary

At its core, a Data Warehouse exists to turn chaotic operational data into structured analytical insights.

While OLTP systems use highly normalized models to handle fast day-to-day transactions safely, Data Warehouses use Dimensional Modeling (usually the Kimball Star Schema) to optimize for massive, read-heavy analytical queries. By establishing a clear grain, decoupling facts from dimensions, utilizing Surrogate Keys, and properly tracking historical changes via SCDs, we build data structures that are incredibly intuitive for analysts and highly performant for BI tools.

Pratik Mane

About the Author

Pratik Mane is a Data Architect and Engineer specializing in Azure Data Platform and Databricks. He helps enterprises build scalable, high-performance ETL pipelines.