Operating Vault: Unsealing, HA, and DR — HashiCorp Vault & Secrets Management | CertQnA

Vault is a critical-path dependency: when it's down, your apps can't get secrets. This lesson covers the operational practices that keep it up — and the recovery steps when something goes wrong.

High Availability with Raft

Modern Vault uses integrated storage — Raft consensus built into the binary. Production topology:

Cluster size	Failure tolerance	When to use
1 node	None	Dev only
3 nodes	1 node failure	Most production deployments
5 nodes	2 node failures	High-availability or multi-AZ requirements

Avoid even numbers — they don't add tolerance but increase quorum risk.

Cluster configuration

storage "raft" {
  path    = "/vault/data"
  node_id = "vault-1"

  retry_join {
    leader_api_addr = "https://vault-2.acme:8200"
  }
  retry_join {
    leader_api_addr = "https://vault-3.acme:8200"
  }
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_cert_file = "/etc/vault/tls/cert.pem"
  tls_key_file  = "/etc/vault/tls/key.pem"
}

api_addr     = "https://vault-1.acme:8200"
cluster_addr = "https://vault-1.acme:8201"
ui = true

Each node specifies retry_join blocks for its peers. After the first node initialises and unseals, the others auto-join.

Auto-Unseal in Production

You should never operate Vault with manual Shamir unsealing. Configure auto-unseal at startup:

seal "awskms" {
  region     = "us-east-1"
  kms_key_id = "alias/vault-auto-unseal"
}

Vault's IAM role on the EC2 / EKS / ECS task must allow kms:Encrypt, kms:Decrypt, and kms:DescribeKey on that key. Equivalent configs exist for Azure (azurekeyvault), GCP (gcpckms), and HSMs (pkcs11).

Recovery keys (Shamir-split) replace unseal keys; they are used for sensitive operations like generating root tokens, not for routine unsealing.

Backups

Take regular Raft snapshots:

vault operator raft snapshot save /backup/vault-$(date +%F).snap

Restore is straightforward:

vault operator raft snapshot restore /backup/vault-2026-05-27.snap

Schedule snapshots hourly (via cron / Kubernetes CronJob) and ship them off-cluster (S3 with versioning + lifecycle rules). Test restore regularly — a backup you can't restore is no backup.

Enterprise: automated snapshots config (sys/storage/raft/snapshot-auto/config/<name>) does this natively to S3/Azure/GCS.

Replication (Enterprise)

Vault Enterprise offers two replication modes:

Mode	Purpose
Performance Replication (PR)	Scale reads — secondary clusters in other regions serve local apps with low latency. Writes still go through the primary.
Disaster Recovery (DR) Replication	A hot standby in another region. Read-only; promotes to primary on failover.

Open-source Vault does not have replication; backup/restore is your DR mechanism.

Lease Management at Scale

Dynamic secrets generate leases — many of them. Long-lived clusters can accumulate millions of leases, which slows down operations. Best practices:

Use short TTLs (1h is the sweet spot for most apps)
Configure default_lease_ttl and max_lease_ttl per mount
Vault 1.13+ uses lease count quotas (sys/quotas/lease-count) — set them per mount to alert before hitting limits
Periodically tidy expired leases: vault write sys/leases/tidy

Audit and Telemetry

Production Vault must have:

An audit device (file or syslog) — shipped to a SIEM
Prometheus telemetry endpoint enabled: telemetry { prometheus_retention_time = "24h", disable_hostname = true }
Standard dashboards: token count, request latency, seal status, leader election events, vault.expire.num_leases
Alerts on: seal status changes, leader election, audit failures, request error rate, sealed status

Common Incidents and Responses

Incident	Response
Suspected credential compromise	`vault lease revoke -prefix <path>` immediately; then rotate static secrets in scope
Vault sealed unexpectedly	Check KMS access (auto-unseal); inspect logs for storage errors
Split-brain after network partition	Raft auto-recovers when quorum is restored; force-restore from snapshot only as last resort
Lost quorum (more than half nodes down)	Restore from snapshot to a new cluster; promote DR replica if Enterprise
Lost root token	`vault operator generate-root` with recovery-key quorum
App can't auth	Check policy, lease quota, time skew (JWT auth is time-sensitive)

Upgrade Path

Vault uses a rolling upgrade model:

Take a snapshot
Upgrade standby nodes one at a time, waiting for each to fully rejoin
Force a leader step-down (vault operator step-down) — a standby becomes active
Upgrade the (former) active node
Verify cluster health and Raft peer count

Read the upgrade guide for the source and target versions; some releases have schema migrations or behaviour changes.

Capacity Planning

CPU is the bottleneck for Transit-heavy workloads
Memory is the bottleneck for high lease counts
Disk I/O matters for storage-heavy workloads (KV writes, audit logs)
Network egress matters for replication

A 4-vCPU / 16 GB Vault node handles thousands of requests per second for typical mixed workloads. Scale horizontally with PR (Enterprise) or by sharding logically (different mounts per cluster).

Operational Maturity Checklist

3+ Raft nodes across AZs
Auto-unseal via cloud KMS
Hourly automated snapshots, off-cluster, retention > 7 days
Audit device shipping to SIEM
Prometheus + alerts on key metrics
Documented runbook for: seal/unseal, lease revocation, restore from snapshot
Quarterly disaster recovery drill
Restricted production root access (use generate-root with quorum)

Hit all of these and your Vault deployment is production-grade.