Skip to content
7 min read·Lesson 7 of 8

Operating Vault: Unsealing, HA, and DR

Run Vault reliably — high availability with Raft, performance and disaster-recovery replication, backup, and incident response.

Vault is a critical-path dependency: when it's down, your apps can't get secrets. This lesson covers the operational practices that keep it up — and the recovery steps when something goes wrong.

High Availability with Raft

Modern Vault uses integrated storage — Raft consensus built into the binary. Production topology:

Cluster sizeFailure toleranceWhen to use
1 nodeNoneDev only
3 nodes1 node failureMost production deployments
5 nodes2 node failuresHigh-availability or multi-AZ requirements

Avoid even numbers — they don't add tolerance but increase quorum risk.

Cluster configuration

storage "raft" {
  path    = "/vault/data"
  node_id = "vault-1"

  retry_join {
    leader_api_addr = "https://vault-2.acme:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-3.acme:8200"
  }
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_cert_file = "/etc/vault/tls/cert.pem"
  tls_key_file  = "/etc/vault/tls/key.pem"
}

api_addr     = "https://vault-1.acme:8200"
cluster_addr = "https://vault-1.acme:8201"
ui = true

Each node specifies retry_join blocks for its peers. After the first node initialises and unseals, the others auto-join.

Auto-Unseal in Production

You should never operate Vault with manual Shamir unsealing. Configure auto-unseal at startup:

seal "awskms" {
  region     = "us-east-1"
  kms_key_id = "alias/vault-auto-unseal"
}

Vault's IAM role on the EC2 / EKS / ECS task must allow kms:Encrypt, kms:Decrypt, and kms:DescribeKey on that key. Equivalent configs exist for Azure (azurekeyvault), GCP (gcpckms), and HSMs (pkcs11).

Recovery keys (Shamir-split) replace unseal keys; they are used for sensitive operations like generating root tokens, not for routine unsealing.

Backups

Take regular Raft snapshots:

vault operator raft snapshot save /backup/vault-$(date +%F).snap

Restore is straightforward:

vault operator raft snapshot restore /backup/vault-2026-05-27.snap

Schedule snapshots hourly (via cron / Kubernetes CronJob) and ship them off-cluster (S3 with versioning + lifecycle rules). Test restore regularly — a backup you can't restore is no backup.

Enterprise: automated snapshots config (sys/storage/raft/snapshot-auto/config/<name>) does this natively to S3/Azure/GCS.

Replication (Enterprise)

Vault Enterprise offers two replication modes:

ModePurpose
Performance Replication (PR)Scale reads — secondary clusters in other regions serve local apps with low latency. Writes still go through the primary.
Disaster Recovery (DR) ReplicationA hot standby in another region. Read-only; promotes to primary on failover.

Open-source Vault does not have replication; backup/restore is your DR mechanism.

Lease Management at Scale

Dynamic secrets generate leases — many of them. Long-lived clusters can accumulate millions of leases, which slows down operations. Best practices:

  • Use short TTLs (1h is the sweet spot for most apps)
  • Configure default_lease_ttl and max_lease_ttl per mount
  • Vault 1.13+ uses lease count quotas (sys/quotas/lease-count) — set them per mount to alert before hitting limits
  • Periodically tidy expired leases: vault write sys/leases/tidy

Audit and Telemetry

Production Vault must have:

  • An audit device (file or syslog) — shipped to a SIEM
  • Prometheus telemetry endpoint enabled: telemetry { prometheus_retention_time = "24h", disable_hostname = true }
  • Standard dashboards: token count, request latency, seal status, leader election events, vault.expire.num_leases
  • Alerts on: seal status changes, leader election, audit failures, request error rate, sealed status

Common Incidents and Responses

IncidentResponse
Suspected credential compromisevault lease revoke -prefix <path> immediately; then rotate static secrets in scope
Vault sealed unexpectedlyCheck KMS access (auto-unseal); inspect logs for storage errors
Split-brain after network partitionRaft auto-recovers when quorum is restored; force-restore from snapshot only as last resort
Lost quorum (more than half nodes down)Restore from snapshot to a new cluster; promote DR replica if Enterprise
Lost root tokenvault operator generate-root with recovery-key quorum
App can't authCheck policy, lease quota, time skew (JWT auth is time-sensitive)

Upgrade Path

Vault uses a rolling upgrade model:

  1. Take a snapshot
  2. Upgrade standby nodes one at a time, waiting for each to fully rejoin
  3. Force a leader step-down (vault operator step-down) — a standby becomes active
  4. Upgrade the (former) active node
  5. Verify cluster health and Raft peer count

Read the upgrade guide for the source and target versions; some releases have schema migrations or behaviour changes.

Capacity Planning

  • CPU is the bottleneck for Transit-heavy workloads
  • Memory is the bottleneck for high lease counts
  • Disk I/O matters for storage-heavy workloads (KV writes, audit logs)
  • Network egress matters for replication

A 4-vCPU / 16 GB Vault node handles thousands of requests per second for typical mixed workloads. Scale horizontally with PR (Enterprise) or by sharding logically (different mounts per cluster).

Operational Maturity Checklist

  • 3+ Raft nodes across AZs
  • Auto-unseal via cloud KMS
  • Hourly automated snapshots, off-cluster, retention > 7 days
  • Audit device shipping to SIEM
  • Prometheus + alerts on key metrics
  • Documented runbook for: seal/unseal, lease revocation, restore from snapshot
  • Quarterly disaster recovery drill
  • Restricted production root access (use generate-root with quorum)

Hit all of these and your Vault deployment is production-grade.

Key Takeaways

  • Production Vault needs 3 or 5 Raft nodes for HA — odd numbers prevent split-brain.
  • Auto-unseal eliminates manual unseal on every restart — required for any modern deployment.
  • Performance replication (Enterprise) scales reads across regions; DR replication is for failover.
  • Snapshot backups via Raft are the standard recovery mechanism.
  • Token revocation by prefix is your most important incident-response tool.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →