Rightsizing and Utilisation — Cloud FinOps and Cost Management | CertQnA

The first rule of cloud optimisation: most things are too big. CPU utilisation across enterprise cloud fleets typically averages 10-20%. Memory averages 30-40%. Storage usage often falls below 50% of provisioned. The gap between provisioned and used is the largest single category of cloud waste.

Rightsizing means closing that gap. It is the second optimisation after killing the obviously idle.

Step 1: Kill the Idle

Before rightsizing, eliminate resources that are not used at all:

EC2 instances with <5% CPU and <5% network for 7+ days
Unattached EBS volumes — they bill regardless of attachment
Old EBS snapshots beyond retention policy
Idle load balancers (zero traffic)
Unused Elastic IPs (billable when not associated)
RDS / databases with no connections
Empty S3 buckets in expensive classes
NAT Gateways in subnets with no outbound traffic
VM scale sets / Auto Scaling Groups at min=N when usage justifies min=0

Each provider has native tooling: AWS Trusted Advisor and AWS Cost Optimization Hub, Azure Advisor, GCP Recommender. They will surface idle resources in minutes. Third-party tools (Vantage, CloudHealth, ProsperOps, Kion, Apptio Cloudability) add automation.

Make this an automated weekly job, not a one-off campaign. Idle resources reappear constantly.

Step 2: Schedule Non-Prod

Development, staging, QA, and demo environments are typically used 8 hours a day, 5 days a week — 40 hours out of 168, or 24% of the week. Running them 24/7 wastes 76%.

Solutions:

Auto-shutdown. AWS Instance Scheduler, Azure DevTest Labs auto-shutdown, GCP scheduled instance group resizing. Tag-driven: schedule=weekday-9-7 shuts down outside hours.
Spin-up on demand. Ephemeral preview environments per PR; tear down on merge. Qovery, Shipyard, GitHub Codespaces and similar make this practical.
Spot/preemptible for non-prod. Acceptable if interruption is recoverable.

Expected savings on non-prod: 60-70%. Often the largest single FinOps win.

Step 3: Rightsize the Survivors

For workloads that genuinely run continuously, match instance size to observed usage.

The methodology

Capture 2-4 weeks of CloudWatch / Azure Monitor / Cloud Monitoring metrics: CPU, memory, network, disk IO, IOPS.
Compute P95 of each metric.
Pick the smallest instance type whose limits exceed P95 by a target buffer (commonly 25-40%).
Check instance family — newer generations are usually cheaper and faster.
Consider Graviton (AWS), Ampere (Azure / Oracle), Tau (GCP) ARM instances for 20-40% additional savings.

Recommendation engines

Use them but verify:

AWS Compute Optimizer
Azure Advisor cost recommendations
GCP Recommender
Third-party: Vantage, ProsperOps, Cloudability, Kubecost (Kubernetes)

They are accurate for steady-state workloads. They miss burst patterns, batch jobs, periodic load tests, and warm pools for failover. Always overlay product-team knowledge.

Don't forget memory and IO

Many recommendations focus on CPU. A right-CPU but wrong-memory choice still wastes money. R instance families exist because some workloads are memory-bound; M-family is general-purpose; C-family is CPU-bound. Pick the family that matches the dominant resource.

RDS, ElastiCache, Redshift

Same approach. Many database instances are sized for peak; consider provisioned IOPS to general-purpose, aurora-serverless v2 for variable workloads, and reader/writer split rather than upsizing the primary.

Step 4: Storage Class and Lifecycle

Storage tiers offer dramatic cost differences:

S3 class	Approx $/GB-month	Access pattern
S3 Standard	$0.023	Active
S3 Standard-IA	$0.0125	Infrequent
S3 One Zone-IA	$0.01	Infrequent, lower availability OK
S3 Glacier Instant Retrieval	$0.004	Archived, occasional access
S3 Glacier Flexible Retrieval	$0.0036	Archived, hours to retrieve
S3 Glacier Deep Archive	$0.00099	Cold, 12-hour retrieve
S3 Intelligent-Tiering	tiered	Auto-moves; small monitoring fee

Equivalent tiers exist on Azure (Hot/Cool/Cold/Archive) and GCP (Standard/Nearline/Coldline/Archive).

Enable lifecycle rules on every bucket: transition to IA after 30 days, Glacier after 90, Deep Archive after 365.
Use S3 Intelligent-Tiering for unknown access patterns — almost always net savings, only $0.0025/1000 objects monitoring fee.
Set incomplete-multipart-upload abort rules — orphaned multi-part uploads accumulate invisibly.
Delete old object versions if versioning is enabled.

Step 5: Modernise the Architecture

Bigger gains come from changing what you run, not how big it is.

Spot / preemptible instances — up to 90% off for interruptible workloads (batch, CI, stateless workers, fault-tolerant training). Karpenter and Cluster Autoscaler make this routine on Kubernetes.
Serverless — pay per request. For low-volume or bursty endpoints often cheaper than always-on containers. Lambda, Cloud Run, Container Apps.
Managed services over self-hosted — DynamoDB vs Cassandra on EC2; managed Redis vs self-managed. Usually higher unit cost but lower total cost when ops, patching, and on-call are included.
Consolidation — multi-tenant where it fits; one well-utilised cluster beats five half-empty ones.
Region choice — same workload can be 20% cheaper in one region than another; especially relevant for batch / training.

Step 6: Continuous Rightsizing

Rightsizing once is a campaign; rightsizing continuously is operating discipline:

Monthly review of top 50 spenders.
Automated reports of P95 utilisation per resource flagged when below 30%.
Kubernetes Vertical Pod Autoscaler in recommendation mode; HPA on the right metric.
Storage lifecycle rules in place by default in your IaC modules.
Spot adoption tracked as a percentage of compute hours.

What to Expect

Optimisation	Typical savings	Effort
Idle resource cleanup	5-15%	Low
Non-prod scheduling	10-20%	Medium
Storage tiering	5-10%	Low
Rightsizing	10-20%	Medium
Spot adoption	10-25%	Medium-High
ARM / Graviton	10-20% of compute	Medium (compat testing)
Architecture modernisation	varies, often largest	High

These compound. Combined, a first-year FinOps programme commonly delivers 25-40% reduction on like-for-like workloads, with the gains compounding as new workloads land in the optimised pattern by default.

One major rate-based lever remains: commitments. That is the next lesson.