Skip to content
6 min read·Lesson 10 of 10

Backups, Replication, and Database Security

Protect production data: design backups with the right RPO/RTO, replicate for HA, and secure the database with encryption, IAM, and network isolation.

Production databases face three categories of risk: data loss, downtime, and unauthorised access. This lesson covers how to defend against each.

RPO and RTO — Define These First

RPO (Recovery Point Objective)How much data are you willing to lose in a disaster? "5 minutes" means your backup strategy must cover any 5-minute window.
RTO (Recovery Time Objective)How long can you be down? "30 minutes" means your restore process must complete in that time.

Tight RPO/RTO costs money. A consumer game might accept "1 hour / 4 hours"; a payments system might require "0 / under 1 minute". Stakeholders should agree before you design the backup architecture.

Backup Types

  • Full — complete copy of the database. Slow but self-contained.
  • Incremental — only changes since the last backup. Fast; needs a chain to restore.
  • Differential — changes since the last full backup. Middle ground.
  • Logical — SQL dump of schema + data (e.g. pg_dump). Portable but slow.
  • Physical — copy of data files (e.g. pg_basebackup). Fast restore.
  • Snapshot — block-storage snapshot. Instantaneous; cloud-managed services use these.

Point-in-Time Recovery (PITR)

Combine a base backup with the database's transaction log (WAL in Postgres, binlog in MySQL) and you can restore to any moment between them. Managed services do this automatically — RDS, Cloud SQL, Azure SQL Database, Aurora.

# Example: Postgres with pgBackRest
pgbackrest --stanza=main --type=full backup
pgbackrest --stanza=main --type=incr backup     # daily

# Restore to a specific time
pgbackrest --stanza=main \
  --type=time --target='2024-08-15 14:35:00 UTC' \
  restore

Backup Discipline

  • Test restores regularly. A backup you've never restored is wishful thinking.
  • Store off-site / cross-region. A regional outage shouldn't take backups too.
  • Encrypt backups at rest. Same data, same risk.
  • Limit retention. Forever-backups become forever-liabilities under data-protection law.
  • Logical + physical. Logical dumps are slower but survive engine version mismatches and corruption.

Replication

Replication keeps a second copy of the database up to date, used for high availability, read scaling, and disaster recovery.

Synchronous vs Asynchronous

ModeHow it worksTrade-off
SynchronousPrimary waits for replica to acknowledge before committingZero data loss; writes are slower
AsynchronousPrimary commits immediately; replica catches upFast writes; small data-loss risk on failover
Semi-syncWait for at least one replicaCompromise; common in MySQL setups

Topologies

  • Primary / replica — one writer, many read replicas. Most common.
  • Multi-AZ — primary plus a hot standby in another availability zone for HA.
  • Multi-region — replicas in other regions for DR or low-latency global reads.
  • Multi-primary / active-active — multiple writers; complex; conflict resolution required (Galera, group replication, distributed SQL).

Failover

When the primary dies, a replica is promoted. Managed services automate this; with self-hosted setups you need a tool like Patroni, repmgr, or Orchestrator. Time to failover is your RTO budget.

Encryption

At rest

  • Cloud-managed services encrypt by default using KMS-managed keys
  • Self-hosted: enable disk-level encryption (LUKS) or transparent data encryption (TDE)
  • Customer-managed keys (CMK) give you control over rotation and revocation
  • Backups must also be encrypted

In transit

  • TLS for all connections, including from app servers in the same VPC
  • Reject non-TLS connections at the database level (rds.force_ssl=1 on RDS, require_secure_transport=ON on MySQL, ssl=on in postgresql.conf)
  • Verify server certificates from clients (sslmode=verify-full in libpq)

Field-level encryption

For especially sensitive fields (payment card data, government IDs), encrypt the value before storing it — even DBAs can't read it without the key. Watch for the loss of indexability; deterministic or searchable encryption schemes exist but have trade-offs.

Authentication and Authorisation

Authentication

  • Password — basic. Rotate; never share; never commit.
  • IAM auth — RDS and Aurora support short-lived tokens via AWS IAM; Cloud SQL via Google IAM. No password to leak.
  • Certificate auth — clients present mTLS certificates.
  • SSO / Kerberos / Active Directory — enterprise environments.

Authorisation: least privilege

-- App user can read and write app data, nothing more
CREATE ROLE app_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;

-- Read-only analytics user
CREATE ROLE analytics_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO analytics_user;
GRANT USAGE ON SCHEMA public TO analytics_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics_user;

-- Migration user — DDL allowed; used only by CI
CREATE ROLE migrator LOGIN PASSWORD '...';
ALTER ROLE migrator IN DATABASE myapp SET search_path = public;
GRANT ALL PRIVILEGES ON DATABASE myapp TO migrator;

Never have your app connect as the superuser. A SQL-injection bug then becomes a database-takeover bug.

Row-level security

Postgres supports RLS — policies that automatically filter rows based on the current user. Useful for multi-tenant SaaS to enforce isolation in the database itself.

Network Isolation

  • Put the database in a private subnet; no public IP
  • Use security groups / firewalls to allow only the app servers
  • VPC peering or private endpoints for cross-account / cross-VPC access
  • For analytics, use bastion hosts or services like AWS Session Manager — never expose the DB to the internet

Audit Logging

  • Enable audit logging for security-relevant events: logins, failed logins, DDL changes, privilege grants
  • Postgres: pgaudit extension
  • MySQL: audit_log plugin
  • SQL Server: SQL Server Audit
  • Ship logs to your central log store (CloudWatch, Splunk, ELK) and alert on anomalies

Connection Pooling

Not strictly security but operational health: each connection costs memory and a backend process in Postgres. A connection pool (PgBouncer, RDS Proxy, ProxySQL) sits between the app and DB and reuses connections.

  • Without pooling, a Lambda function or Kubernetes deployment can easily exhaust connection limits
  • PgBouncer with transaction pooling lets thousands of clients share a small set of DB connections

An Operational Checklist

  1. RPO & RTO defined and tested with a real restore
  2. Backups encrypted, off-region, retention bounded
  3. Multi-AZ replication; automated failover tested
  4. Encryption at rest with managed keys; TLS enforced
  5. App connects as a least-privilege user; admin/superuser locked down
  6. Database in private subnet; no public access
  7. Audit logs flowing to central store; alerts on suspicious patterns
  8. Connection pooling in place
  9. Runbook for common incidents: failover, restore, slow queries, disk full

Do these and your database is ready for production. Skip them and you'll find out which one matters first the hard way.

Key Takeaways

  • Define RPO (data loss tolerance) and RTO (downtime tolerance) before choosing a backup strategy.
  • Point-in-time recovery (PITR) lets you roll back to any second within the retention window.
  • Replication gives high availability and read scaling — sync for safety, async for performance.
  • Encrypt at rest (TDE/KMS) and in transit (TLS); rotate keys; never put data on unencrypted disks.
  • Use least-privilege grants, IAM auth where possible, private networking, and audit logs in production.
🎉

Course Complete!

You've finished SQL and Databases Fundamentals. Now put your knowledge to the test with real exam-style practice questions.