Open-source observability is powerful but operationally expensive. APM (Application Performance Monitoring) vendors offer the same capabilities as a unified, hosted product. For most companies, the buy-vs-build math favours buying.
What an APM Platform Includes
- Metrics ingestion (Prometheus-compatible or proprietary)
- Distributed tracing (often OpenTelemetry-compatible)
- Log aggregation
- Real User Monitoring (RUM) — performance from the browser
- Synthetic monitoring — uptime probes from many regions
- Infrastructure monitoring (host agent)
- Alerting and on-call (with PagerDuty/Slack integration)
- Service maps, dashboards, anomaly detection
The pitch: one agent, one bill, one UI for the whole observability stack.
Datadog
The market leader. Strengths:
- 500+ integrations — almost any tool you run has a Datadog integration.
- Strong unified UI: pivot from a metric to a trace to a log seamlessly.
- Excellent infrastructure and Kubernetes support.
- Watchdog AI for anomaly detection.
- Datadog Workflows, CSPM, security monitoring layered on top.
Weaknesses: pricing is the dominant complaint — by host, by container, by ingested GB, by custom metric, with surprising overage bills. Cost engineering is a real activity.
New Relic
Long-time APM vendor. Reset to a "consumption" pricing model (per user + per ingested GB).
- NRQL — a SQL-like query language across all data types.
- Strong APM and language agents (especially Java, .NET, Ruby).
- Includes browser and mobile monitoring.
- "New Relic One" unifies all features.
Dynatrace
Enterprise focus. Famous for "OneAgent" — a single agent that auto-discovers everything on a host.
- Davis AI — root-cause analysis based on the topology graph.
- Strong for traditional Java/.NET enterprise stacks and SAP.
- Less popular with smaller / cloud-native shops; pricier.
Honeycomb
Different philosophy. Built around "wide events" — every request is one rich record with hundreds of attributes. You ask questions like "p95 latency by build_id and feature_flag, broken out by region, only for users on iOS" and get an answer in seconds.
- BubbleUp — auto-finds attributes correlated with anomalies.
- OpenTelemetry-native.
- Excellent for debugging unknown unknowns in microservices.
- Smaller integrations footprint — pair with another tool for infra/RUM if needed.
Splunk
Originally a log search company, now a full observability suite (Splunk Observability Cloud after the SignalFx acquisition).
- Best-in-class log search, especially for security/SIEM use cases.
- Splunk APM and Infrastructure are credible.
- Often the right answer in regulated industries already running Splunk for security.
Elastic Observability
The Elastic Stack (Elasticsearch + Kibana) extended into APM.
- Strong if you already run Elastic for search or logs.
- OpenTelemetry-compatible.
- Self-hostable or managed (Elastic Cloud).
Cloud-Native Bundled APM
| Cloud | Service |
|---|---|
| AWS | CloudWatch + X-Ray + Application Signals (newer unified APM) |
| Azure | Azure Monitor + Application Insights |
| GCP | Cloud Operations Suite (Logging, Monitoring, Trace, Profiler) |
Pros: zero ops, IAM-integrated, often the cheapest at small-to-medium scale. Cons: weaker cross-cloud, lock-in, varied UX.
Choosing
| If you… | Consider |
|---|---|
| Want one tool that does everything | Datadog |
| Are on a tight budget but technical | Grafana stack (self-host) |
| Already on a single cloud | That cloud's native suite |
| Have complex microservices and need deep debugging | Honeycomb |
| Run a Java/.NET enterprise stack | Dynatrace or New Relic |
| Need security + observability together | Splunk or Datadog |
| Already invested in Elastic | Elastic Observability |
Pricing Models, Decoded
- Per host — Datadog. Predictable for VMs, expensive for thousands of containers (often there is also a container surcharge).
- Per ingested GB — most log services. Volume control is everything.
- Per custom metric — Datadog and New Relic surprise people here. Each unique label combination counts.
- Per user + per GB — New Relic, Honeycomb.
- Pay-as-you-go cloud — CloudWatch, Azure Monitor.
Always model the bill at 2× current scale. Run a one-month POC with realistic data before committing.
Avoiding Lock-In
OpenTelemetry is your insurance policy. Instrument your code with the OTel SDK. The Collector exports to whichever backend you pay for. If your APM vendor doubles their price, switch in a config change.
Avoid:
- Vendor-specific SDKs as your primary instrumentation.
- Heavy use of vendor-specific query languages in alert rules.
- Storing all dashboards as JSON only in the vendor — keep them in Git too.
Buy or Build?
Most teams under 200 engineers should buy. The hidden cost of self-hosting Prometheus + Loki + Tempo + Alertmanager + Grafana at scale — patching, scaling, on-call for the observability platform itself — usually exceeds the licence fee. Self-host once you are big enough that the licence fee exceeds an SRE team. There is no shame in either direction; pick the one that fits your stage.