The cost areas teams discover last are storage and data movement. Compute optimisation is well-trodden ground. Data and network are where the surprises live — especially as estates grow and AI/analytics workloads push terabytes around.
The NAT Gateway Trap
An AWS NAT Gateway costs $0.045/hour ($32/month) plus $0.045 per GB processed. The hourly is trivial; the per-GB is not. A Kubernetes cluster pulling 10 TB of container images and external dependencies through a NAT gateway: $450/month just for the data processing fee. Add another 10 TB for application egress and you are paying for a small EC2 instance.
Fixes:
- VPC Endpoints / Private Endpoints for S3, ECR, DynamoDB, Secrets Manager, KMS. Traffic bypasses the NAT Gateway entirely. S3 Gateway Endpoint is free.
- Image caching / Pull-through cache for ECR — pull from ECR (in-region, via endpoint) instead of public Docker Hub via NAT.
- One NAT Gateway shared across AZs in dev/test where the AZ-locality cost is acceptable. Production should still have one per AZ for resilience.
- Egress proxy or NAT instance for high-volume, low-criticality use cases.
Same pattern on Azure (NAT Gateway, Private Endpoint, Private Link) and GCP (Cloud NAT, Private Service Connect).
Inter-AZ and Inter-Region Traffic
| Pattern | AWS pricing (approx.) |
|---|---|
| Inter-AZ same region | $0.01/GB each way (i.e., $0.02 round-trip) |
| Inter-region (same continent) | $0.02/GB egress |
| Inter-region (cross-continent) | $0.02-0.09/GB egress |
| To internet | $0.05-0.09/GB egress (first tier) |
A microservice in AZ1 that calls a database in AZ2 millions of times a day with 5 KB payloads: easy $1,000s/month. Mitigations:
- Topology-aware routing in service mesh / Kubernetes — keep traffic AZ-local when replicas exist locally.
- Read replicas per AZ for read-heavy workloads.
- Cache at the consumer — Redis local to the service, not across AZs.
- Single AZ for non-prod — accept lower availability for big cost wins.
The Egress Lock-In
Egress to the public internet (or to a different cloud) is where lock-in lives. Hyperscaler internet egress: $0.05-0.09/GB at low tiers, dropping to $0.02 at petabyte scale. Specialty providers (Cloudflare, Fastly, Bunny) often cap at fractions of that.
Mitigations:
- CDN in front of everything user-facing. Cloudflare and Fastly have very different egress economics; their bandwidth is often free or much cheaper.
- Direct Connect / ExpressRoute / Interconnect for high-volume on-prem ↔ cloud paths. Reduces per-GB rate dramatically at high volumes.
- Same-cloud architecture — keep services and data in the same provider where the egress is cheap or free (intra-region private IP traffic is often $0).
- S3 / Storage Account egress to CloudFront / Front Door / Cloud CDN is heavily discounted — design the public delivery path through the CDN, not the origin.
- AWS recently waived egress for customers leaving the cloud (after EU regulatory pressure); use that allowance if you ever do.
Note also: S3 Glacier retrieval fees can dwarf the storage savings if you retrieve more than expected. Tier carefully.
Storage Class Lifecycle
Covered briefly in lesson 4; recap with the lifecycle:
# S3 lifecycle policy (simplified)
Transition:
- 30 days → Standard-IA
- 90 days → Glacier Instant Retrieval
- 365 days → Glacier Deep Archive
Expiration:
- 7 years (matches retention policy)
- Incomplete multipart uploads: 7 days
- Non-current versions: 90 days then delete
Implement once in your IaC modules, applies to every bucket new and existing. Equivalent rules on Azure Blob and GCS.
Snapshots and Backups
Each EBS snapshot stores only changed blocks but bills for that delta indefinitely until the snapshot chain is broken. A daily snapshot retained for a year on a 1 TB volume with 5% daily churn: significant. Patterns:
- Lifecycle policy on snapshots — daily for 7 days, weekly for 4, monthly for 12, then delete.
- EBS Snapshot Archive tier (75% cheaper, slower restore) for long-term retention.
- Move RDS automated backups beyond retention to S3 + Glacier; cheaper than RDS backup storage.
- Audit pre-snapshot vs post-snapshot encryption — encrypted-then-snapshotted volumes use a new encryption key and may not share blocks with prior snapshots.
Data Warehouse and Lake Costs
Snowflake, BigQuery, Redshift, Databricks bill on compute (per-second warehouse, per-query slot, DBU) plus storage. Compute usually dominates.
Common waste:
- Warehouses left running at large size. Auto-suspend after 1-5 minutes; right-size to S/M default; scale up explicitly when needed.
- Full-scan queries. Partition every table by date; cluster by frequent filter columns; use materialised views for common aggregations.
- SELECT * in BI tools. Columnar formats charge per scanned column.
- Ad-hoc query workloads on production warehouses. Separate compute clusters per workload class (BI / ELT / data science).
BigQuery specifically: switch to Editions (slot-based) if monthly cost from on-demand exceeds the cost of a reserved slot pool.
Lakehouse formats (Iceberg, Delta, Hudi) plus separation of storage from compute give the same query patterns at lower cost — and across multiple compute engines.
Observability Data Costs
Logs, metrics, traces are themselves a major and growing line item:
- CloudWatch Logs ingestion + storage
- Splunk indexing fees
- Datadog per-host + log indexing
- New Relic per-user + ingest
- Self-hosted Prometheus / Loki / Tempo + the storage backing them
Patterns:
- Sample. Not every request needs a full trace. 1-5% sampling, with 100% for errors, is common.
- Tier logs by severity / namespace. Cold logs to S3 (queried with Athena); hot logs only for active investigation.
- Drop labels and fields you do not query. Prometheus cardinality explosions are a classic — a label per user-id will multiply storage 1000x.
- Retention policies. 14 days hot, 90 days warm, 1 year cold is enough for most teams.
- OpenTelemetry collector + S3 for archival; Grafana Loki / Tempo for hot storage; you control the egress.
The Cost-Aware Architecture Checklist
At design review, ask:
- Does this design route any traffic through a NAT gateway? If so, can a VPC endpoint replace it?
- Is inter-AZ chatter expected? Can we collocate or cache?
- What is the projected egress to internet? Is the CDN path right?
- Where does the data sit in 1, 12, 36 months? Are tiers and lifecycles defined?
- What is the backup retention and is it tiered?
- Is the data warehouse / lake right-sized? Auto-suspend? Reservation if steady?
- Is observability sampled and tiered?
This checklist applied at architectural review prevents the next year's "where did this cost come from" investigation. Cost as a non-functional requirement, alongside latency and availability.
The final lesson covers the cultural and tooling pieces — turning these techniques into a sustained practice rather than a one-off campaign.