Grafana Dashboards and Visualization — Observability and Monitoring | CertQnA

Grafana is the dashboarding layer over Prometheus, Loki, Tempo, CloudWatch, Elasticsearch, BigQuery, MySQL, PostgreSQL — anywhere you store time-series or queryable data. It does not store metrics itself.

Anatomy of a Dashboard

Data sources — connections to backends (Prometheus, Loki, etc.)
Panels — individual visualisations (graph, stat, table, gauge, heatmap)
Variables — drop-downs at the top that templatize queries
Time picker — global time range and refresh interval
Annotations — overlay events (deploys, incidents) on graphs

Panel Types and When to Use Them

Type	Use for
Time series	Anything over time (rate, latency, errors)
Stat	One big number — current value, with trend sparkline
Gauge / Bar gauge	"How full" — disk, queue, connection pool
Table	Top-N lists, top slow endpoints, current alerts
Heatmap	Latency distributions over time
Logs	Live tail of Loki / Elasticsearch query
Pie chart	Almost never — humans read time series better

The RED Method (Services)

For every service, show three things on the top row:

Rate — requests per second
Errors — error rate or error count per second
Duration — latency distribution (p50/p95/p99)

# Rate
sum by (service) (rate(http_requests_total[5m]))

# Errors
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))

# Duration p95
histogram_quantile(0.95,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

RED is the smallest dashboard that covers the most ground. Start every service with these three panels, then add more.

The USE Method (Resources)

For every resource (CPU, disk, memory, network):

Utilisation — % of time it is busy
Saturation — extra work queued (load average, run queue, swap usage)
Errors — error counts (disk errors, network drops)

USE is for hosts and infrastructure; RED is for services. Use both.

Variables

Variables prevent you from copy-pasting dashboards.

$service = label_values(http_requests_total, service)
$env     = label_values(http_requests_total{service="$service"}, env)
$instance = label_values(up{service="$service",env="$env"}, instance)

One dashboard, drop-downs at the top — works for every service, environment, and instance.

Annotations

Overlay deploys, incidents, feature-flag flips on every graph. Two common sources:

Prometheus annotation query: changes(deploy_info[1m]) > 0
HTTP webhook from your CI: POST to Grafana annotations API on each deploy

The visual correlation of "p99 spiked exactly at the deploy" is what shortens incidents.

Dashboard Hygiene

One dashboard, one job. "Service overview", "DB internals", "Browser performance" — separate.
Consistent layout: top row = RED, second row = saturation, third row = errors and warnings.
Use units (ms, %, requests/s). Grafana renders them properly when units are set.
Avoid hard-coded values. Templatize with variables.
Add a "Description" panel at the top with a runbook link and ownership.
Link panels — clicking on a service in a list jumps to its detail dashboard.
Delete dashboards no one uses. Stale dashboards are worse than none.

Provisioning Dashboards as Code

Manually clicking around the UI is fine for prototyping; production dashboards belong in Git.

# grafana provisioning config
apiVersion: 1
providers:
  - name: 'default'
    folder: 'Services'
    type: file
    options:
      path: /var/lib/grafana/dashboards

Drop JSON files in that folder (or write them in Jsonnet/Grafonnet, Terraform grafana_dashboard, or Pulumi). Code review your dashboards like any other config.

Alerts in Grafana vs Alertmanager

Grafana now has its own unified alerting that can drive Alertmanager or run standalone. Two viable patterns:

Prometheus rules + Alertmanager — Prom-native, alerts live with metrics.
Grafana alerting — works across multiple data sources, easier UI.

Pick one and stick with it. Splitting alerts across two systems creates duplicate noise.

The Real Goal

A great Grafana dashboard answers, in five seconds, "is this service healthy right now?" If you have to scroll, click, and squint to figure out the answer, the dashboard has failed. Design top-down: green/yellow/red status row at the top, drill-downs below.