Skip to content
5 min read·Lesson 5 of 10

Log Aggregation: ELK, Loki, and CloudWatch

How modern logs flow from container to query. The ELK stack, Grafana Loki, cloud-native log services, and the trade-offs between them.

Logs flow from many places (containers, hosts, lambdas, browsers) to one searchable home. The pipeline is conceptually the same regardless of vendor.

The Four Stages of a Log Pipeline

[ Producer ]  →  [ Shipper / Agent ]  →  [ Storage / Index ]  →  [ Query / UI ]
   app, syslog,      Fluent Bit, Vector,       Elasticsearch, Loki,    Kibana, Grafana,
   journald, k8s     Filebeat, FluentD,        OpenSearch, S3, GCS     CloudWatch console
                     Logstash, Promtail

Each stage has many implementations; mix and match.

ELK / Elastic Stack

The classic open-source log stack:

  • Elasticsearch — full-text search index over logs
  • Logstash / Beats — collection and transformation
  • Kibana — UI for search, dashboards, and analytics

OpenSearch is AWS's open-source fork after Elastic's licence change. Functionally similar.

Strengths: extremely powerful queries, full-text search, mature dashboards, many integrations.

Trade-offs: indexing every field is expensive — Elasticsearch is RAM- and disk-hungry. Operating a cluster at scale (sharding, hot/warm/cold tiers, snapshots) is real work. Most teams now use a managed offering (Elastic Cloud, AWS OpenSearch Service).

Grafana Loki

Loki was designed as "Prometheus, but for logs." It indexes only a small set of labels (service, env, level) — the log content itself is stored as compressed chunks in object storage (S3, GCS, Azure Blob).

{service="api", env="prod"} |= "timeout" | json | duration_ms > 1000

The query language (LogQL) mirrors PromQL — same brackets, same filtering, same metric-style aggregations.

Strengths: very cheap to run, minimal indexing, integrates beautifully with Grafana and Prometheus, scales horizontally.

Trade-offs: queries that scan large volumes are slower than Elasticsearch full-text. Best when you can narrow by labels first, then grep.

Cloud-Native Log Services

ServiceNotes
AWS CloudWatch LogsDefault for Lambda, ECS, EKS. Log Insights query language.
Azure Monitor Logs (Log Analytics)KQL (Kusto Query Language). Powerful, well-integrated.
Google Cloud LoggingAuto-collects from GKE, GCE, Cloud Run. Strong filter UI.

Pros: zero ops, integrated IAM, instant ingestion of platform-emitted logs, link from billing/audit naturally.

Cons: vendor lock-in, per-GB pricing can hurt at high volume, query languages are platform-specific.

Tip: ship a copy to S3/GCS using built-in export for long-term cheap retention; keep the hot index small.

Commercial APMs

Datadog, Splunk, Sumo Logic, New Relic Logs all offer hosted log management bundled with metrics and traces. Strong UX, expensive at scale, but you pay one bill instead of integrating four tools.

Common Shippers

ShipperStrength
Fluent BitTiny, fast, written in C — default in Kubernetes
VectorRust, modern, supports complex transforms
FilebeatElastic's official shipper for files / journald
PromtailLoki's shipper, label-aware
FluentDOlder but still common, large plugin ecosystem
LogstashHeavy but powerful transformation language

In Kubernetes, Fluent Bit DaemonSets are the most common pattern: one pod per node tails container logs from /var/log/containers/*.log and forwards them.

Volume Control

Logs are often the biggest line item on observability bills. Three controls:

  1. Log levels — INFO in prod, DEBUG only when needed.
  2. Sampling — keep all errors; sample 10% of successful request logs.
  3. Filters at the shipper — drop noisy paths (health checks) before ingest.
# fluent-bit example: drop health checks and sample successful requests
[FILTER]
    Name    grep
    Match   app.*
    Exclude $path /healthz

[FILTER]
    Name    throttle
    Match   app.success.*
    Rate    100
    Window  1
    Interval 1s

Retention Strategy

TierStorageRetention
HotIndexed (Elastic, Loki, CloudWatch)7–30 days
WarmIndexed but cheaper hardware30–90 days
ColdObject storage (S3, GCS), unindexed1–7 years for compliance

Most queries hit the last 24 hours. Sizing the hot tier for that, with cheap cold storage behind it, dramatically reduces cost.

The Pragmatic Choice

  • Already on Grafana for metrics? Add Loki — zero new UI, cheap, simple.
  • Need full-text search and complex analytics? Use Elasticsearch / OpenSearch.
  • Single-cloud and want zero ops? Use the native service.
  • Already paying for Datadog? Just use Datadog Logs.

The next lesson covers structured logging — the application-side practice that makes any of these tools 10× more useful.

Key Takeaways

  • A log pipeline has four stages: produce, ship, store, query.
  • ELK (Elasticsearch + Logstash/Beats + Kibana) indexes everything — powerful but expensive.
  • Loki indexes only labels, stores log content cheaply — Prom-style queries on logs.
  • Cloud services (CloudWatch Logs, Azure Monitor, Cloud Logging) are zero-ops but lock-in.
  • Always rotate, structure, and bound your log volume before shipping.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →