Skip to content
5 min read·Lesson 4 of 10

Grafana Dashboards and Visualization

Build dashboards that operators actually use. Panels, variables, the RED and USE methods, and dashboard hygiene.

Grafana is the dashboarding layer over Prometheus, Loki, Tempo, CloudWatch, Elasticsearch, BigQuery, MySQL, PostgreSQL — anywhere you store time-series or queryable data. It does not store metrics itself.

Anatomy of a Dashboard

  • Data sources — connections to backends (Prometheus, Loki, etc.)
  • Panels — individual visualisations (graph, stat, table, gauge, heatmap)
  • Variables — drop-downs at the top that templatize queries
  • Time picker — global time range and refresh interval
  • Annotations — overlay events (deploys, incidents) on graphs

Panel Types and When to Use Them

TypeUse for
Time seriesAnything over time (rate, latency, errors)
StatOne big number — current value, with trend sparkline
Gauge / Bar gauge"How full" — disk, queue, connection pool
TableTop-N lists, top slow endpoints, current alerts
HeatmapLatency distributions over time
LogsLive tail of Loki / Elasticsearch query
Pie chartAlmost never — humans read time series better

The RED Method (Services)

For every service, show three things on the top row:

  • Rate — requests per second
  • Errors — error rate or error count per second
  • Duration — latency distribution (p50/p95/p99)
# Rate
sum by (service) (rate(http_requests_total[5m]))

# Errors
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))

# Duration p95
histogram_quantile(0.95,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

RED is the smallest dashboard that covers the most ground. Start every service with these three panels, then add more.

The USE Method (Resources)

For every resource (CPU, disk, memory, network):

  • Utilisation — % of time it is busy
  • Saturation — extra work queued (load average, run queue, swap usage)
  • Errors — error counts (disk errors, network drops)

USE is for hosts and infrastructure; RED is for services. Use both.

Variables

Variables prevent you from copy-pasting dashboards.

$service = label_values(http_requests_total, service)
$env     = label_values(http_requests_total{service="$service"}, env)
$instance = label_values(up{service="$service",env="$env"}, instance)

One dashboard, drop-downs at the top — works for every service, environment, and instance.

Annotations

Overlay deploys, incidents, feature-flag flips on every graph. Two common sources:

  • Prometheus annotation query: changes(deploy_info[1m]) > 0
  • HTTP webhook from your CI: POST to Grafana annotations API on each deploy

The visual correlation of "p99 spiked exactly at the deploy" is what shortens incidents.

Dashboard Hygiene

  • One dashboard, one job. "Service overview", "DB internals", "Browser performance" — separate.
  • Consistent layout: top row = RED, second row = saturation, third row = errors and warnings.
  • Use units (ms, %, requests/s). Grafana renders them properly when units are set.
  • Avoid hard-coded values. Templatize with variables.
  • Add a "Description" panel at the top with a runbook link and ownership.
  • Link panels — clicking on a service in a list jumps to its detail dashboard.
  • Delete dashboards no one uses. Stale dashboards are worse than none.

Provisioning Dashboards as Code

Manually clicking around the UI is fine for prototyping; production dashboards belong in Git.

# grafana provisioning config
apiVersion: 1
providers:
  - name: 'default'
    folder: 'Services'
    type: file
    options:
      path: /var/lib/grafana/dashboards

Drop JSON files in that folder (or write them in Jsonnet/Grafonnet, Terraform grafana_dashboard, or Pulumi). Code review your dashboards like any other config.

Alerts in Grafana vs Alertmanager

Grafana now has its own unified alerting that can drive Alertmanager or run standalone. Two viable patterns:

  • Prometheus rules + Alertmanager — Prom-native, alerts live with metrics.
  • Grafana alerting — works across multiple data sources, easier UI.

Pick one and stick with it. Splitting alerts across two systems creates duplicate noise.

The Real Goal

A great Grafana dashboard answers, in five seconds, "is this service healthy right now?" If you have to scroll, click, and squint to figure out the answer, the dashboard has failed. Design top-down: green/yellow/red status row at the top, drill-downs below.

Key Takeaways

  • Grafana queries any data source — Prometheus, Loki, CloudWatch, BigQuery, MySQL, and many more.
  • Use the RED method (Rate, Errors, Duration) for services and USE (Utilisation, Saturation, Errors) for resources.
  • Variables make one dashboard reusable across services, regions, environments.
  • Good dashboards answer one question well; great dashboards link to drill-downs.
  • Less is more — dashboards with 40 panels never get read.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →