Vault is a critical-path dependency: when it's down, your apps can't get secrets. This lesson covers the operational practices that keep it up — and the recovery steps when something goes wrong.
High Availability with Raft
Modern Vault uses integrated storage — Raft consensus built into the binary. Production topology:
| Cluster size | Failure tolerance | When to use |
|---|---|---|
| 1 node | None | Dev only |
| 3 nodes | 1 node failure | Most production deployments |
| 5 nodes | 2 node failures | High-availability or multi-AZ requirements |
Avoid even numbers — they don't add tolerance but increase quorum risk.
Cluster configuration
storage "raft" {
path = "/vault/data"
node_id = "vault-1"
retry_join {
leader_api_addr = "https://vault-2.acme:8200"
}
retry_join {
leader_api_addr = "https://vault-3.acme:8200"
}
}
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_cert_file = "/etc/vault/tls/cert.pem"
tls_key_file = "/etc/vault/tls/key.pem"
}
api_addr = "https://vault-1.acme:8200"
cluster_addr = "https://vault-1.acme:8201"
ui = true
Each node specifies retry_join blocks for its peers. After the first node initialises and unseals, the others auto-join.
Auto-Unseal in Production
You should never operate Vault with manual Shamir unsealing. Configure auto-unseal at startup:
seal "awskms" {
region = "us-east-1"
kms_key_id = "alias/vault-auto-unseal"
}
Vault's IAM role on the EC2 / EKS / ECS task must allow kms:Encrypt, kms:Decrypt, and kms:DescribeKey on that key. Equivalent configs exist for Azure (azurekeyvault), GCP (gcpckms), and HSMs (pkcs11).
Recovery keys (Shamir-split) replace unseal keys; they are used for sensitive operations like generating root tokens, not for routine unsealing.
Backups
Take regular Raft snapshots:
vault operator raft snapshot save /backup/vault-$(date +%F).snap
Restore is straightforward:
vault operator raft snapshot restore /backup/vault-2026-05-27.snap
Schedule snapshots hourly (via cron / Kubernetes CronJob) and ship them off-cluster (S3 with versioning + lifecycle rules). Test restore regularly — a backup you can't restore is no backup.
Enterprise: automated snapshots config (sys/storage/raft/snapshot-auto/config/<name>) does this natively to S3/Azure/GCS.
Replication (Enterprise)
Vault Enterprise offers two replication modes:
| Mode | Purpose |
|---|---|
| Performance Replication (PR) | Scale reads — secondary clusters in other regions serve local apps with low latency. Writes still go through the primary. |
| Disaster Recovery (DR) Replication | A hot standby in another region. Read-only; promotes to primary on failover. |
Open-source Vault does not have replication; backup/restore is your DR mechanism.
Lease Management at Scale
Dynamic secrets generate leases — many of them. Long-lived clusters can accumulate millions of leases, which slows down operations. Best practices:
- Use short TTLs (1h is the sweet spot for most apps)
- Configure
default_lease_ttlandmax_lease_ttlper mount - Vault 1.13+ uses lease count quotas (
sys/quotas/lease-count) — set them per mount to alert before hitting limits - Periodically tidy expired leases:
vault write sys/leases/tidy
Audit and Telemetry
Production Vault must have:
- An audit device (file or syslog) — shipped to a SIEM
- Prometheus telemetry endpoint enabled:
telemetry { prometheus_retention_time = "24h", disable_hostname = true } - Standard dashboards: token count, request latency, seal status, leader election events, vault.expire.num_leases
- Alerts on: seal status changes, leader election, audit failures, request error rate, sealed status
Common Incidents and Responses
| Incident | Response |
|---|---|
| Suspected credential compromise | vault lease revoke -prefix <path> immediately; then rotate static secrets in scope |
| Vault sealed unexpectedly | Check KMS access (auto-unseal); inspect logs for storage errors |
| Split-brain after network partition | Raft auto-recovers when quorum is restored; force-restore from snapshot only as last resort |
| Lost quorum (more than half nodes down) | Restore from snapshot to a new cluster; promote DR replica if Enterprise |
| Lost root token | vault operator generate-root with recovery-key quorum |
| App can't auth | Check policy, lease quota, time skew (JWT auth is time-sensitive) |
Upgrade Path
Vault uses a rolling upgrade model:
- Take a snapshot
- Upgrade standby nodes one at a time, waiting for each to fully rejoin
- Force a leader step-down (
vault operator step-down) — a standby becomes active - Upgrade the (former) active node
- Verify cluster health and Raft peer count
Read the upgrade guide for the source and target versions; some releases have schema migrations or behaviour changes.
Capacity Planning
- CPU is the bottleneck for Transit-heavy workloads
- Memory is the bottleneck for high lease counts
- Disk I/O matters for storage-heavy workloads (KV writes, audit logs)
- Network egress matters for replication
A 4-vCPU / 16 GB Vault node handles thousands of requests per second for typical mixed workloads. Scale horizontally with PR (Enterprise) or by sharding logically (different mounts per cluster).
Operational Maturity Checklist
- 3+ Raft nodes across AZs
- Auto-unseal via cloud KMS
- Hourly automated snapshots, off-cluster, retention > 7 days
- Audit device shipping to SIEM
- Prometheus + alerts on key metrics
- Documented runbook for: seal/unseal, lease revocation, restore from snapshot
- Quarterly disaster recovery drill
- Restricted production root access (use generate-root with quorum)
Hit all of these and your Vault deployment is production-grade.