Production databases face three categories of risk: data loss, downtime, and unauthorised access. This lesson covers how to defend against each.
RPO and RTO — Define These First
| RPO (Recovery Point Objective) | How much data are you willing to lose in a disaster? "5 minutes" means your backup strategy must cover any 5-minute window. |
| RTO (Recovery Time Objective) | How long can you be down? "30 minutes" means your restore process must complete in that time. |
Tight RPO/RTO costs money. A consumer game might accept "1 hour / 4 hours"; a payments system might require "0 / under 1 minute". Stakeholders should agree before you design the backup architecture.
Backup Types
- Full — complete copy of the database. Slow but self-contained.
- Incremental — only changes since the last backup. Fast; needs a chain to restore.
- Differential — changes since the last full backup. Middle ground.
- Logical — SQL dump of schema + data (e.g.
pg_dump). Portable but slow. - Physical — copy of data files (e.g.
pg_basebackup). Fast restore. - Snapshot — block-storage snapshot. Instantaneous; cloud-managed services use these.
Point-in-Time Recovery (PITR)
Combine a base backup with the database's transaction log (WAL in Postgres, binlog in MySQL) and you can restore to any moment between them. Managed services do this automatically — RDS, Cloud SQL, Azure SQL Database, Aurora.
# Example: Postgres with pgBackRest
pgbackrest --stanza=main --type=full backup
pgbackrest --stanza=main --type=incr backup # daily
# Restore to a specific time
pgbackrest --stanza=main \
--type=time --target='2024-08-15 14:35:00 UTC' \
restore
Backup Discipline
- Test restores regularly. A backup you've never restored is wishful thinking.
- Store off-site / cross-region. A regional outage shouldn't take backups too.
- Encrypt backups at rest. Same data, same risk.
- Limit retention. Forever-backups become forever-liabilities under data-protection law.
- Logical + physical. Logical dumps are slower but survive engine version mismatches and corruption.
Replication
Replication keeps a second copy of the database up to date, used for high availability, read scaling, and disaster recovery.
Synchronous vs Asynchronous
| Mode | How it works | Trade-off |
|---|---|---|
| Synchronous | Primary waits for replica to acknowledge before committing | Zero data loss; writes are slower |
| Asynchronous | Primary commits immediately; replica catches up | Fast writes; small data-loss risk on failover |
| Semi-sync | Wait for at least one replica | Compromise; common in MySQL setups |
Topologies
- Primary / replica — one writer, many read replicas. Most common.
- Multi-AZ — primary plus a hot standby in another availability zone for HA.
- Multi-region — replicas in other regions for DR or low-latency global reads.
- Multi-primary / active-active — multiple writers; complex; conflict resolution required (Galera, group replication, distributed SQL).
Failover
When the primary dies, a replica is promoted. Managed services automate this; with self-hosted setups you need a tool like Patroni, repmgr, or Orchestrator. Time to failover is your RTO budget.
Encryption
At rest
- Cloud-managed services encrypt by default using KMS-managed keys
- Self-hosted: enable disk-level encryption (LUKS) or transparent data encryption (TDE)
- Customer-managed keys (CMK) give you control over rotation and revocation
- Backups must also be encrypted
In transit
- TLS for all connections, including from app servers in the same VPC
- Reject non-TLS connections at the database level (
rds.force_ssl=1on RDS,require_secure_transport=ONon MySQL,ssl=oninpostgresql.conf) - Verify server certificates from clients (
sslmode=verify-fullin libpq)
Field-level encryption
For especially sensitive fields (payment card data, government IDs), encrypt the value before storing it — even DBAs can't read it without the key. Watch for the loss of indexability; deterministic or searchable encryption schemes exist but have trade-offs.
Authentication and Authorisation
Authentication
- Password — basic. Rotate; never share; never commit.
- IAM auth — RDS and Aurora support short-lived tokens via AWS IAM; Cloud SQL via Google IAM. No password to leak.
- Certificate auth — clients present mTLS certificates.
- SSO / Kerberos / Active Directory — enterprise environments.
Authorisation: least privilege
-- App user can read and write app data, nothing more
CREATE ROLE app_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
-- Read-only analytics user
CREATE ROLE analytics_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO analytics_user;
GRANT USAGE ON SCHEMA public TO analytics_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics_user;
-- Migration user — DDL allowed; used only by CI
CREATE ROLE migrator LOGIN PASSWORD '...';
ALTER ROLE migrator IN DATABASE myapp SET search_path = public;
GRANT ALL PRIVILEGES ON DATABASE myapp TO migrator;
Never have your app connect as the superuser. A SQL-injection bug then becomes a database-takeover bug.
Row-level security
Postgres supports RLS — policies that automatically filter rows based on the current user. Useful for multi-tenant SaaS to enforce isolation in the database itself.
Network Isolation
- Put the database in a private subnet; no public IP
- Use security groups / firewalls to allow only the app servers
- VPC peering or private endpoints for cross-account / cross-VPC access
- For analytics, use bastion hosts or services like AWS Session Manager — never expose the DB to the internet
Audit Logging
- Enable audit logging for security-relevant events: logins, failed logins, DDL changes, privilege grants
- Postgres:
pgauditextension - MySQL:
audit_logplugin - SQL Server: SQL Server Audit
- Ship logs to your central log store (CloudWatch, Splunk, ELK) and alert on anomalies
Connection Pooling
Not strictly security but operational health: each connection costs memory and a backend process in Postgres. A connection pool (PgBouncer, RDS Proxy, ProxySQL) sits between the app and DB and reuses connections.
- Without pooling, a Lambda function or Kubernetes deployment can easily exhaust connection limits
- PgBouncer with transaction pooling lets thousands of clients share a small set of DB connections
An Operational Checklist
- RPO & RTO defined and tested with a real restore
- Backups encrypted, off-region, retention bounded
- Multi-AZ replication; automated failover tested
- Encryption at rest with managed keys; TLS enforced
- App connects as a least-privilege user; admin/superuser locked down
- Database in private subnet; no public access
- Audit logs flowing to central store; alerts on suspicious patterns
- Connection pooling in place
- Runbook for common incidents: failover, restore, slow queries, disk full
Do these and your database is ready for production. Skip them and you'll find out which one matters first the hard way.