Skip to content
6 min read·Lesson 2 of 10

Data Modeling and Normalization

Learn how to design tables: choosing keys, picking data types, expressing relationships, and the normalization rules that keep data consistent.

Before you write a single SQL query, you have to design the tables. Good schema design is the difference between a database that scales gracefully and one that fights you for every feature.

Entities and Relationships

Start with the real-world things your app cares about — these are entities. For a blog: users, posts, comments, tags. Each entity becomes a table; each instance of an entity becomes a row.

Then think about how they relate:

  • One-to-many: a user has many posts; a post belongs to one user. The "many" side carries a foreign key.
  • Many-to-many: a post has many tags; a tag is on many posts. Use a third join table.
  • One-to-one: rare; usually a sign two tables should be one.

A Worked Example

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    title VARCHAR(200) NOT NULL,
    body TEXT NOT NULL,
    published_at TIMESTAMPTZ
);

CREATE TABLE tags (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR(50) UNIQUE NOT NULL
);

CREATE TABLE post_tags (
    post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
    tag_id BIGINT NOT NULL REFERENCES tags(id) ON DELETE CASCADE,
    PRIMARY KEY (post_id, tag_id)
);

Things to notice:

  • id BIGSERIAL PRIMARY KEY — auto-incrementing surrogate primary key.
  • UNIQUE constraint on email prevents duplicates.
  • REFERENCES users(id) creates a foreign key. ON DELETE CASCADE removes child rows when the parent is deleted.
  • NOT NULL on required columns — be intentional about what's allowed to be missing.
  • The join table post_tags has a composite primary key of both columns — preventing the same tag being added twice.

Primary Keys: Surrogate vs Natural

ApproachExampleTrade-offs
Auto-increment integerBIGSERIALSmall, fast, simple. Reveals row count and order.
UUIDuuid_generate_v4() / UUIDv7No coordination across servers; opaque; larger (16 bytes).
Natural keyEmail, ISBN, country codeAvoid for anything that might change.

Most modern designs use surrogate keys (auto-increment or UUIDv7) for the primary key and add unique constraints on natural identifiers separately. UUIDv7 is time-ordered, giving you the best of both worlds for distributed systems.

Foreign Keys

A foreign key declares "this column must reference an existing row in that table." The DB enforces it for you — you cannot insert a post with a user_id that doesn't exist.

ON DELETE options when the referenced row is deleted:

  • CASCADE — delete the dependent rows too
  • RESTRICT / NO ACTION — prevent the parent delete
  • SET NULL — leave the row, set the FK column to NULL
  • SET DEFAULT — set to a default value

Pick deliberately. Comments under a deleted post: cascade. An author of a deleted comment: probably set null and leave the comment as "anonymous".

Data Types: Pick Narrowly

TypeUse for
INT (4 bytes)Counts, small IDs (max ~2 billion)
BIGINT (8 bytes)IDs, anything that may grow large
NUMERIC(p,s) / DECIMALMoney — never use FLOAT for money
VARCHAR(N) / TEXTStrings; in Postgres they're equivalent — pick TEXT and add CHECK constraints if needed
BOOLEANtrue / false
TIMESTAMPTZAlways store time with timezone (UTC under the hood)
DATEDate without time
JSONB (Postgres)Flexible attributes; queryable, indexable
UUIDUniversally unique IDs

Avoid FLOAT/REAL for currency — binary floating point can't represent 0.10 exactly, leading to rounding bugs.

Normalization

Normalization is a set of rules for eliminating redundancy. The first three forms cover almost everything you'll ever need:

1NF — Atomic columns

Each cell holds a single value. Don't store "red,blue,green" in one column — make it three rows in a related table (or a tags array column with explicit semantics).

2NF — No partial dependencies

If you have a composite key, every non-key column must depend on the whole key. Mostly relevant when using composite keys.

3NF — No transitive dependencies

Non-key columns should depend only on the primary key. If you store a customer's country_id and country_name in the orders table, the name depends on the country, not the order — move it to a countries table.

For most apps, "design tables around entities, give each table a primary key, use foreign keys for relationships, don't repeat data" gets you to 3NF without thinking about it.

Denormalization

Sometimes you deliberately duplicate data — a precomputed sum, a cached "post count" on the user table — to avoid expensive queries. This is denormalization.

  • Common in analytical / OLAP databases (a single big "fact" table joined to "dimension" tables)
  • Common in NoSQL where joins are expensive or unavailable
  • Always introduces the risk of data drifting out of sync — be deliberate, document it

Schema Migrations

Schemas evolve. Don't run ALTER TABLE by hand in production. Use a migration tool that records each change as a versioned, reviewable file:

  • Flyway, Liquibase — language-neutral, SQL-first
  • Alembic (Python / SQLAlchemy)
  • Prisma Migrate, Drizzle Kit (Node)
  • Active Record migrations (Rails)

Each migration goes through code review and CI. You can roll forward and backward predictably across environments.

A Quick Sanity Check

Before you create the table, ask:

  1. What are the entities?
  2. What's the primary key for each?
  3. What are the relationships and what do ON DELETE rules should be?
  4. What columns must be NOT NULL?
  5. What needs to be unique beyond the primary key?
  6. Will I query this column with WHERE a lot? — that's a future index.

Spend twenty minutes on these questions and you'll save weeks of refactoring later.

Key Takeaways

  • Identify entities and relationships first; tables and columns fall out of that.
  • Every table should have a primary key; surrogate keys (auto-increment, UUID) are usually best.
  • Foreign keys express relationships; ON DELETE rules define what happens when the parent goes away.
  • Normalize to 3NF for OLTP; denormalize for analytics.
  • Pick types narrowly — INT vs BIGINT, VARCHAR(N) vs TEXT, TIMESTAMPTZ for time.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →