The single highest-leverage change you can make to your logs is structuring them. Text logs require parsing, regex, fragility. JSON logs are queryable on day one.
Bad vs Good
Bad:
2024-04-12 14:32:01 INFO User 42 logged in from 10.1.1.4 in 230ms
Good:
{
"ts": "2024-04-12T14:32:01.123Z",
"level": "info",
"service": "auth-api",
"env": "prod",
"msg": "user_login",
"user_id": 42,
"ip": "10.1.1.4",
"duration_ms": 230,
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"request_id": "req_abc123"
}
Now you can answer "all logins for user 42 in the last hour" with a single filter — no regex.
The Canonical Field Set
Adopt a small core set across every service. Suggested:
| Field | Purpose |
|---|---|
ts | RFC 3339 / ISO 8601 timestamp with timezone |
level | debug, info, warn, error, fatal |
service | Service name (auth-api, payments) |
env | prod / staging / dev |
version | Build SHA or semver |
msg | Short, low-cardinality event name |
trace_id / span_id | OpenTelemetry context for pivoting to traces |
request_id | Correlation across services for one user request |
user_id / tenant_id | Business context |
Add domain-specific fields per log line — but the core ten should always be present.
One Log Per Request: The Canonical Log Line
Stripe popularised this pattern: emit one richly structured log line per request that summarises everything. Instead of 30 small logs, you have one wide event.
{
"ts": "2024-04-12T14:32:01Z",
"level": "info",
"service": "checkout",
"msg": "request_complete",
"method": "POST",
"path": "/checkout",
"status": 200,
"duration_ms": 412,
"db_calls": 3,
"db_total_ms": 78,
"downstream_ms": { "auth": 60, "payment": 145, "inventory": 30 },
"user_id": 4242,
"tenant_id": "acme",
"trace_id": "4bf92...",
"request_id": "req_abc123",
"feature_flags": ["new_checkout=true", "mfa=on"],
"build": "9d4f1ab"
}
Search "all checkout requests with status=200 and duration_ms>1000 last hour" with one filter. Faster, cheaper, more useful than 30 scattered DEBUG lines.
Correlation IDs
A request ID generated at the edge (load balancer, API gateway) and propagated via headers (X-Request-Id, traceparent) so every downstream log line carries it. With OpenTelemetry, the trace_id serves the same purpose for free.
The pivot from "user reported error at 14:32" to "every log line for that request across 8 services" is what turns logs from forensic guesswork into a powerful tool.
Log Levels — The Four You Need
DEBUG | Verbose for local development. Off in prod by default. |
INFO | Significant events you would want during an incident. |
WARN | Something unexpected, recoverable, worth investigating later. |
ERROR | Something failed, alerting may be appropriate. |
Log levels are a contract with future you. WARN means "something to look at"; if you cry wolf at WARN, no one will look.
Secrets and PII
Logs are the easiest place to leak. Common offenders:
- Authorization headers, API keys, JWTs
- Full credit card numbers, CVVs (PCI violation)
- Email addresses, phone numbers, government IDs (PII / GDPR / HIPAA)
- Stack traces with environment variables
- SQL queries containing parameters
Three defences:
- Field allow-lists in your logger — only known fields are emitted.
- Redaction at the shipper (Fluent Bit/Vector regex masks).
- Pre-prod tests that scan logs for credit-card and JWT patterns.
Logging in Code: Examples
Node (pino):
import pino from 'pino';
const log = pino({
base: { service: 'checkout', env: process.env.NODE_ENV, version: process.env.GIT_SHA },
redact: ['req.headers.authorization', 'password', '*.creditCard'],
});
log.info({ user_id: 42, duration_ms: 230, request_id: req.id }, 'user_login');
Python (structlog):
import structlog
log = structlog.get_logger()
log.info("user_login", user_id=42, duration_ms=230, request_id=req_id)
Go (zap or slog):
slog.Info("user_login",
"user_id", 42,
"duration_ms", 230,
"request_id", reqID,
)
Every modern language has a structured logger. Use it.
Anti-Patterns
- Concatenating into
msg: "User " + id + " logged in" — defeats the whole point. - Putting JSON inside JSON — flatten before logging.
- Stack traces split across many lines — emit as a single field.
- Logging in tight loops — sample or aggregate.
- Logging at ERROR for non-errors (404s, validation failures) — wakes people for nothing.
The Practical Bar
If you can answer these in your log tool today, you are doing well:
- "Show me everything that happened for request_id X across all services."
- "Show me all errors for tenant Y in the last hour."
- "Show me requests slower than 1 second by route."
- "Show me everything that happened during the deploy at 14:30 ± 5 minutes."
If any of those is hard, fix the structure first.