Structured Logging — Observability and Monitoring | CertQnA

The single highest-leverage change you can make to your logs is structuring them. Text logs require parsing, regex, fragility. JSON logs are queryable on day one.

Bad vs Good

Bad:

2024-04-12 14:32:01 INFO User 42 logged in from 10.1.1.4 in 230ms

Good:

{
  "ts": "2024-04-12T14:32:01.123Z",
  "level": "info",
  "service": "auth-api",
  "env": "prod",
  "msg": "user_login",
  "user_id": 42,
  "ip": "10.1.1.4",
  "duration_ms": 230,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_abc123"
}

Now you can answer "all logins for user 42 in the last hour" with a single filter — no regex.

The Canonical Field Set

Adopt a small core set across every service. Suggested:

Field	Purpose
`ts`	RFC 3339 / ISO 8601 timestamp with timezone
`level`	debug, info, warn, error, fatal
`service`	Service name (auth-api, payments)
`env`	prod / staging / dev
`version`	Build SHA or semver
`msg`	Short, low-cardinality event name
`trace_id` / `span_id`	OpenTelemetry context for pivoting to traces
`request_id`	Correlation across services for one user request
`user_id` / `tenant_id`	Business context

Add domain-specific fields per log line — but the core ten should always be present.

One Log Per Request: The Canonical Log Line

Stripe popularised this pattern: emit one richly structured log line per request that summarises everything. Instead of 30 small logs, you have one wide event.

{
  "ts": "2024-04-12T14:32:01Z",
  "level": "info",
  "service": "checkout",
  "msg": "request_complete",
  "method": "POST",
  "path": "/checkout",
  "status": 200,
  "duration_ms": 412,
  "db_calls": 3,
  "db_total_ms": 78,
  "downstream_ms": { "auth": 60, "payment": 145, "inventory": 30 },
  "user_id": 4242,
  "tenant_id": "acme",
  "trace_id": "4bf92...",
  "request_id": "req_abc123",
  "feature_flags": ["new_checkout=true", "mfa=on"],
  "build": "9d4f1ab"
}

Search "all checkout requests with status=200 and duration_ms>1000 last hour" with one filter. Faster, cheaper, more useful than 30 scattered DEBUG lines.

Correlation IDs

A request ID generated at the edge (load balancer, API gateway) and propagated via headers (X-Request-Id, traceparent) so every downstream log line carries it. With OpenTelemetry, the trace_id serves the same purpose for free.

The pivot from "user reported error at 14:32" to "every log line for that request across 8 services" is what turns logs from forensic guesswork into a powerful tool.

Log Levels — The Four You Need

`DEBUG`	Verbose for local development. Off in prod by default.
`INFO`	Significant events you would want during an incident.
`WARN`	Something unexpected, recoverable, worth investigating later.
`ERROR`	Something failed, alerting may be appropriate.

Log levels are a contract with future you. WARN means "something to look at"; if you cry wolf at WARN, no one will look.

Secrets and PII

Logs are the easiest place to leak. Common offenders:

Authorization headers, API keys, JWTs
Full credit card numbers, CVVs (PCI violation)
Email addresses, phone numbers, government IDs (PII / GDPR / HIPAA)
Stack traces with environment variables
SQL queries containing parameters

Three defences:

Field allow-lists in your logger — only known fields are emitted.
Redaction at the shipper (Fluent Bit/Vector regex masks).
Pre-prod tests that scan logs for credit-card and JWT patterns.

Logging in Code: Examples

Node (pino):

import pino from 'pino';
const log = pino({
  base: { service: 'checkout', env: process.env.NODE_ENV, version: process.env.GIT_SHA },
  redact: ['req.headers.authorization', 'password', '*.creditCard'],
});

log.info({ user_id: 42, duration_ms: 230, request_id: req.id }, 'user_login');

Python (structlog):

import structlog
log = structlog.get_logger()

log.info("user_login", user_id=42, duration_ms=230, request_id=req_id)

Go (zap or slog):

slog.Info("user_login",
    "user_id", 42,
    "duration_ms", 230,
    "request_id", reqID,
)

Every modern language has a structured logger. Use it.

Anti-Patterns

Concatenating into msg: "User " + id + " logged in" — defeats the whole point.
Putting JSON inside JSON — flatten before logging.
Stack traces split across many lines — emit as a single field.
Logging in tight loops — sample or aggregate.
Logging at ERROR for non-errors (404s, validation failures) — wakes people for nothing.

The Practical Bar

If you can answer these in your log tool today, you are doing well:

"Show me everything that happened for request_id X across all services."
"Show me all errors for tenant Y in the last hour."
"Show me requests slower than 1 second by route."
"Show me everything that happened during the deploy at 14:30 ± 5 minutes."

If any of those is hard, fix the structure first.