Table of Contents
Written with a little help from AI, reviewed and edited by a human.
When designing a database, you need to decide how to organize your data into tables. Should everything go in one giant table? Should you split it into many small tables? Normalization is a set of rules that help you design databases that are efficient, consistent, and easy to maintain.
Real life example
Imagine you’re organizing a library. You could write all the information about each book on a single card:
Card: Harry Potter
Author: J.K. Rowling
Author Email: jk@example.com
Author Country: UK
Publisher: Bloomsbury
Publisher Address: 50 Bedford Square, London
Category: Fantasy
This works, but what if J.K. Rowling writes 50 books? You’d write her email and country 50 times. If her email changes, you’d have to update 50 cards.
Instead, you’d create separate index cards:
- Books card: Book ID, Title, Author ID, Publisher ID, Category
- Authors card: Author ID, Name, Email, Country
- Publishers card: Publisher ID, Name, Address
Now each piece of information is stored once. If an author’s email changes, you update one card, not 50.
That’s what normalization does — it organizes data so you don’t repeat information unnecessarily.
What is normalization?
Normalization is the process of organizing data to:
- Eliminate redundancy (don’t store the same data multiple times)
- Ensure data integrity (changes in one place don’t leave other places out-of-date)
- Make updates easier (change once instead of changing many times)
There are several “levels” of normalization called normal forms (1NF, 2NF, 3NF, etc.). Each level follows stricter rules.
For most applications, reaching 3NF (Third Normal Form) is enough.
The problem: a denormalized table
Let’s say you have a table storing orders:
| order_id | customer_name | customer_email | product_name | product_price | quantity |
|---|---|---|---|---|---|
| 1 | Alice | alice@example.com | Laptop | 1000 | 1 |
| 2 | Alice | alice@example.com | Mouse | 25 | 2 |
| 3 | Bob | bob@example.com | Laptop | 1000 | 1 |
| 4 | Alice | alice@example.com | Keyboard | 50 | 1 |
This works, but notice the problems:
- Alice’s name and email are repeated 3 times. If her email changes, you have to update 3 rows.
- Laptop’s price is stored twice. If the price changes, you have to update multiple rows.
- If you delete order 1, you don’t lose Alice’s information, but if you delete orders 2 and 4 as well, Alice disappears entirely from the database even though she’s still a customer.
These issues are called anomalies, and normalization fixes them.
First Normal Form (1NF)
Rule: Each column must contain atomic (indivisible) values, and each row must be unique.
Bad example:
| order_id | customer_name | products |
|---|---|---|
| 1 | Alice | Laptop, Mouse |
| 2 | Bob | Keyboard, Monitor |
The products column contains multiple values (Laptop and Mouse). This violates 1NF.
Fixed (1NF):
| order_id | customer_name | product |
|---|---|---|
| 1 | Alice | Laptop |
| 1 | Alice | Mouse |
| 2 | Bob | Keyboard |
| 2 | Bob | Monitor |
Now each column contains a single value, and each row is unique.
Second Normal Form (2NF)
Rule: Must be in 1NF, and every non-key column must depend on the entire primary key, not just part of it.
This only matters when you have a composite primary key (a primary key made of multiple columns).
Bad example:
| order_id | product_id | customer_name | product_name |
|---|---|---|---|
| 1 | 101 | Alice | Laptop |
| 2 | 102 | Bob | Mouse |
The primary key is (order_id, product_id). But:
customer_namedepends only onorder_id, notproduct_idproduct_namedepends only onproduct_id, notorder_id
This violates 2NF because non-key columns depend on only part of the primary key.
Fixed (2NF):
Split into separate tables:
orders
| order_id | customer_name |
|---|---|
| 1 | Alice |
| 2 | Bob |
order_items
| order_id | product_id |
|---|---|
| 1 | 101 |
| 2 | 102 |
products
| product_id | product_name |
|---|---|
| 101 | Laptop |
| 102 | Mouse |
Now each non-key column depends on the full primary key of its table.
Third Normal Form (3NF)
Rule: Must be in 2NF, and no non-key column should depend on another non-key column (called transitive dependency).
Bad example:
| order_id | customer_name | customer_email | customer_country |
|---|---|---|---|
| 1 | Alice | alice@example.com | USA |
| 2 | Bob | bob@example.com | UK |
customer_country depends on customer_name, not directly on order_id. This is a transitive dependency and violates 3NF.
Fixed (3NF):
Split into:
orders
| order_id | customer_id |
|---|---|
| 1 | 10 |
| 2 | 20 |
customers
| customer_id | name | country | |
|---|---|---|---|
| 10 | Alice | alice@example.com | USA |
| 20 | Bob | bob@example.com | UK |
Now customer details are stored once in the customers table, and orders reference the customer by customer_id.
Our original example, normalized
Let’s normalize the messy orders table from earlier:
Original:
| order_id | customer_name | customer_email | product_name | product_price | quantity |
|---|---|---|---|---|---|
| 1 | Alice | alice@example.com | Laptop | 1000 | 1 |
| 2 | Alice | alice@example.com | Mouse | 25 | 2 |
Normalized (3NF):
customers
| customer_id | name | |
|---|---|---|
| 1 | Alice | alice@example.com |
| 2 | Bob | bob@example.com |
products
| product_id | name | price |
|---|---|---|
| 101 | Laptop | 1000 |
| 102 | Mouse | 25 |
| 103 | Keyboard | 50 |
orders
| order_id | customer_id | order_date |
|---|---|---|
| 1 | 1 | 2026-02-20 |
| 2 | 1 | 2026-02-21 |
| 3 | 2 | 2026-02-22 |
order_items
| order_id | product_id | quantity |
|---|---|---|
| 1 | 101 | 1 |
| 2 | 102 | 2 |
| 3 | 101 | 1 |
Now:
- Customer info is stored once in
customers - Product info is stored once in
products - Each order references a customer
- Each order item references an order and a product
If Alice’s email changes, you update one row. If the laptop price changes, you update one row. No redundancy.
Benefits of normalization
Eliminates redundancy: Data is stored once, not repeated.
Easier updates: Change data in one place, not many places.
Data consistency: No risk of conflicting data (e.g., two different emails for the same customer).
Smaller database size: Less repeated data = less storage.
Denormalization — When to break the rules
Sometimes normalization goes too far. If you have to JOIN 10 tables to get simple data, queries can be slow.
Denormalization means intentionally adding redundancy to improve performance.
For example, you might store a customer’s name in the orders table even though it’s also in the customers table, so you don’t have to JOIN every time you want to show an order.
Use denormalization carefully, only when performance is a real problem.
Summary
- Normalization organizes data to eliminate redundancy and ensure consistency
- 1NF: No repeating groups, atomic values
- 2NF: No partial dependencies (non-key columns depend on the full primary key)
- 3NF: No transitive dependencies (non-key columns don’t depend on other non-key columns)
- Normalized databases are easier to maintain and update
- Sometimes denormalization is okay for performance, but use it sparingly
For most applications, aim for 3NF — it strikes a good balance between data integrity and simplicity.