Backups, Replication, and Database Security

Production databases face three categories of risk: data loss, downtime, and unauthorised access. This lesson covers how to defend against each.

RPO and RTO — Define These First

RPO (Recovery Point Objective)	How much data are you willing to lose in a disaster? "5 minutes" means your backup strategy must cover any 5-minute window.
RTO (Recovery Time Objective)	How long can you be down? "30 minutes" means your restore process must complete in that time.

Tight RPO/RTO costs money. A consumer game might accept "1 hour / 4 hours"; a payments system might require "0 / under 1 minute". Stakeholders should agree before you design the backup architecture.

Backup Types

Full — complete copy of the database. Slow but self-contained.
Incremental — only changes since the last backup. Fast; needs a chain to restore.
Differential — changes since the last full backup. Middle ground.
Logical — SQL dump of schema + data (e.g. pg_dump). Portable but slow.
Physical — copy of data files (e.g. pg_basebackup). Fast restore.
Snapshot — block-storage snapshot. Instantaneous; cloud-managed services use these.

Point-in-Time Recovery (PITR)

Combine a base backup with the database's transaction log (WAL in Postgres, binlog in MySQL) and you can restore to any moment between them. Managed services do this automatically — RDS, Cloud SQL, Azure SQL Database, Aurora.

# Example: Postgres with pgBackRest
pgbackrest --stanza=main --type=full backup
pgbackrest --stanza=main --type=incr backup     # daily

# Restore to a specific time
pgbackrest --stanza=main \
  --type=time --target='2024-08-15 14:35:00 UTC' \
  restore

Backup Discipline

Test restores regularly. A backup you've never restored is wishful thinking.
Store off-site / cross-region. A regional outage shouldn't take backups too.
Encrypt backups at rest. Same data, same risk.
Limit retention. Forever-backups become forever-liabilities under data-protection law.
Logical + physical. Logical dumps are slower but survive engine version mismatches and corruption.

Replication

Replication keeps a second copy of the database up to date, used for high availability, read scaling, and disaster recovery.

Synchronous vs Asynchronous

Mode	How it works	Trade-off
Synchronous	Primary waits for replica to acknowledge before committing	Zero data loss; writes are slower
Asynchronous	Primary commits immediately; replica catches up	Fast writes; small data-loss risk on failover
Semi-sync	Wait for at least one replica	Compromise; common in MySQL setups

Topologies

Primary / replica — one writer, many read replicas. Most common.
Multi-AZ — primary plus a hot standby in another availability zone for HA.
Multi-region — replicas in other regions for DR or low-latency global reads.
Multi-primary / active-active — multiple writers; complex; conflict resolution required (Galera, group replication, distributed SQL).

Failover

When the primary dies, a replica is promoted. Managed services automate this; with self-hosted setups you need a tool like Patroni, repmgr, or Orchestrator. Time to failover is your RTO budget.

Encryption

At rest

Cloud-managed services encrypt by default using KMS-managed keys
Self-hosted: enable disk-level encryption (LUKS) or transparent data encryption (TDE)
Customer-managed keys (CMK) give you control over rotation and revocation
Backups must also be encrypted

In transit

TLS for all connections, including from app servers in the same VPC
Reject non-TLS connections at the database level (rds.force_ssl=1 on RDS, require_secure_transport=ON on MySQL, ssl=on in postgresql.conf)
Verify server certificates from clients (sslmode=verify-full in libpq)

Field-level encryption

For especially sensitive fields (payment card data, government IDs), encrypt the value before storing it — even DBAs can't read it without the key. Watch for the loss of indexability; deterministic or searchable encryption schemes exist but have trade-offs.

Authentication and Authorisation

Authentication

Password — basic. Rotate; never share; never commit.
IAM auth — RDS and Aurora support short-lived tokens via AWS IAM; Cloud SQL via Google IAM. No password to leak.
Certificate auth — clients present mTLS certificates.
SSO / Kerberos / Active Directory — enterprise environments.

Authorisation: least privilege

-- App user can read and write app data, nothing more
CREATE ROLE app_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;

-- Read-only analytics user
CREATE ROLE analytics_user LOGIN PASSWORD '...';
GRANT CONNECT ON DATABASE myapp TO analytics_user;
GRANT USAGE ON SCHEMA public TO analytics_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics_user;

-- Migration user — DDL allowed; used only by CI
CREATE ROLE migrator LOGIN PASSWORD '...';
ALTER ROLE migrator IN DATABASE myapp SET search_path = public;
GRANT ALL PRIVILEGES ON DATABASE myapp TO migrator;

Never have your app connect as the superuser. A SQL-injection bug then becomes a database-takeover bug.

Row-level security

Postgres supports RLS — policies that automatically filter rows based on the current user. Useful for multi-tenant SaaS to enforce isolation in the database itself.

Network Isolation

Put the database in a private subnet; no public IP
Use security groups / firewalls to allow only the app servers
VPC peering or private endpoints for cross-account / cross-VPC access
For analytics, use bastion hosts or services like AWS Session Manager — never expose the DB to the internet

Audit Logging

Enable audit logging for security-relevant events: logins, failed logins, DDL changes, privilege grants
Postgres: pgaudit extension
MySQL: audit_log plugin
SQL Server: SQL Server Audit
Ship logs to your central log store (CloudWatch, Splunk, ELK) and alert on anomalies

Connection Pooling

Not strictly security but operational health: each connection costs memory and a backend process in Postgres. A connection pool (PgBouncer, RDS Proxy, ProxySQL) sits between the app and DB and reuses connections.

Without pooling, a Lambda function or Kubernetes deployment can easily exhaust connection limits
PgBouncer with transaction pooling lets thousands of clients share a small set of DB connections

An Operational Checklist

RPO & RTO defined and tested with a real restore
Backups encrypted, off-region, retention bounded
Multi-AZ replication; automated failover tested
Encryption at rest with managed keys; TLS enforced
App connects as a least-privilege user; admin/superuser locked down
Database in private subnet; no public access
Audit logs flowing to central store; alerts on suspicious patterns
Connection pooling in place
Runbook for common incidents: failover, restore, slow queries, disk full

Do these and your database is ready for production. Skip them and you'll find out which one matters first the hard way.