Observability vs Monitoring: A Practitioner's Take

Monitoring tells you what you knew to ask. Observability tells you what you didn't. Those are not the same thing, and buying a "monitoring platform" and calling it observability does not make it so. I've sat through enough vendor demos where the Datadog rep explains that you can now ask novel questions because you bought APM to know this: your ability to ask novel questions depends on the shape of the data, not the logo on the dashboard.

The three pillars — and why they aren't enough

Metrics, logs, traces. You know the list. The problem is that the three-pillars framing implies they're equal, orthogonal, and sufficient. They aren't.

Metrics are cheap but pre-aggregated. A counter that's been bucketed to one-minute resolution has already thrown away the information you need to debug the weird tail latency.
Logs are expensive and unstructured. If your logs are text lines, you're grepping. Grep is not observability. Structured logs (JSON, key/value) plus indexed querying is the minimum bar.
Traces are structured but sampled. And the thing you want to debug is almost always in the 99.9th percentile — which your head-based sampler threw away.

What actually gives you observability is the ability to slice a single unit of work — a request, a job, a transaction — by arbitrary dimensions you didn't pre-declare. Charity Majors calls this high-cardinality, wide events. She's right. If your telemetry library forces you to declare dimensions at ingest time, you've built a monitoring system.

High-cardinality is the whole game

Prometheus is wonderful until someone adds user_id as a label. Cardinality explodes, storage dies, and the on-call engineer deletes the metric at 3am. Prometheus is a metrics system, not an events system. Treating it as observability is a category error.

What you want is events with hundreds of dimensions per event, queryable ad-hoc. Honeycomb pioneered this. ClickHouse-backed stacks (think SigNoz, or rolling your own with OTel Collector → ClickHouse) do the same thing with open source. Grafana's Loki plus Tempo gets partway there. Datadog will do it if you're willing to pay for the cardinality tier.

Sampling that doesn't lie to you

Every non-trivial tracing deployment needs sampling. The wrong answer is head-based sampling at 1% — which is the default and which guarantees you never capture the interesting request. The right answer is tail-based sampling: let the request complete, then decide whether it's interesting based on errors, latency, or specific attributes.

Keep every 5xx. Keep every request over the 99th-percentile latency. Keep a configurable fraction of the boring ones for baseline. Throw the rest away.

OpenTelemetry Collector supports tail sampling natively. Use it. Pay for the collector tier instead of storing ten billion uninteresting spans.

The log retention bill

Naive log retention is the single biggest observability cost line I see. Teams ingest everything into Splunk or CloudWatch Logs at $0.50-$3/GB ingested, set retention to 90 days because "compliance," and wake up to a $400K annual bill for logs nobody ever queries past day 3.

What works:

Tiered retention: 7 days hot and queryable, 30 days warm in S3 with Athena or similar, archive beyond that.
Drop noisy log sources at the collector — health checks, chatty frameworks, debug statements that shipped by accident.
Convert high-volume structured logs into metrics at the edge. You don't need every access log line if the dimensions you care about are already in a histogram.

OTel is the convergence point

OpenTelemetry isn't the fastest library, the prettiest SDK, or the cheapest way to get started. It is the only telemetry standard you should be betting on in 2026. Instrument once, ship to whatever backend, change backends when the vendor gets greedy without re-instrumenting. That alone is worth the tradeoffs.

The architecture that works: apps instrumented with OTel SDKs → OTel Collector (with sampling, enrichment, filtering) → your chosen backend. The collector is the seam. Own the seam.

The takeaway: If your dashboards answer the question you're asking, you have monitoring, and that's fine for known failure modes. If novel incidents still mean SSH-ing into boxes and tailing logs, you don't have observability — no matter what you bought. Start with structured, wide events. Sample intelligently. Own the collector.