Modern Full-Stack Observability
Cover logs, metrics, traces, and RUM for production engineering teams — with both a cost-efficient OSS path and a premium managed path.
The Stack
OpenTelemetry
— Instrumentation layerVendor-neutral SDK and collector for traces, metrics, and logs. Instrument once, route to any backend. Prevents lock-in to a single observability vendor.
Grafana
— Unified observability dashboards (OSS path) optionalConnects to Prometheus, Loki, and Tempo as data sources. Single pane of glass for all signals. Free self-hosted; Grafana Cloud has a generous free tier.
Prometheus
— Metrics storage and alerting (OSS path) optionalIndustry-standard pull-based metrics. PromQL is the de-facto metrics query language. Pairs with Alertmanager for on-call routing.
Alternatives: thanos, cortex, victoria-metrics
Loki
— Log aggregation (OSS path) optionalPrometheus-style log indexing — stores only labels, not full text. Dramatically cheaper than Elasticsearch at scale. Integrates natively with Grafana.
Alternatives: jaeger, zipkin
Tempo
— Distributed tracing backend (OSS path) optionalPairs with Grafana for trace correlation. Stores traces in object storage (S3/GCS), keeping cost predictable even at high trace volume.
Datadog
— All-in-one managed observability (premium path) optionalUnified APM, logs, metrics, synthetics, and RUM in one SaaS platform. Best-in-class AI-assisted root-cause analysis. High cost at scale — budget carefully.
Alternatives: new-relic, honeycomb
Honeycomb
— High-cardinality event analytics (premium path) optionalPurpose-built for query-driven debugging of distributed traces. Excels where Datadog/Grafana struggle with high-cardinality fields (user IDs, request paths).
Axiom
— Affordable managed log analytics optionalStores unlimited logs at a flat predictable price. Good Datadog logs replacement for log-heavy workloads. Ships a native OpenTelemetry endpoint.
Sentry
— Error tracking and session replayCaptures stack traces, release regressions, and user session replays. Covers both frontend JS errors and backend exceptions. Works on OSS and managed paths.
PostHog
— Product analytics and RUM optionalAdds user-level session recordings, feature flags, funnels, and A/B testing alongside basic frontend performance monitoring. Can be self-hosted for data privacy.
Alternatives: umami, plausible
Gotchas
- ⚠️ Datadog billing is notoriously complex — ingestion volume, number of hosts, and APM spans are billed separately. A single chatty microservice can triple your bill overnight. Set spend alerts and carve out log exclusion filters on day one.
- ⚠️ The Grafana OSS stack requires real operational investment: Prometheus needs sharding or Thanos/Cortex for HA, Loki needs careful label design to stay performant, and Tempo needs object-storage lifecycle policies. Budget 0.5–1 FTE of platform time.
- ⚠️ OpenTelemetry auto-instrumentation adds ~5–15 ms overhead per request in some languages. Benchmark your critical paths before rolling out to production.
- ⚠️ Sentry session replay captures PII by default. Configure scrubbing rules before enabling replay in production environments subject to GDPR/CCPA.
- ⚠️ Mixing OSS and managed tools (e.g. Grafana Cloud + Axiom + Sentry) can create alert correlation gaps — invest in a unified on-call runbook so engineers know which tool to open first.
Related Stacks
Kubernetes Platform Foundation
Core tooling for an internal Kubernetes platform team — GitOps delivery, policy enforcement, secret management, cluster autoscaling, and backup.
Modern CI/CD Starter Kit
Fast, maintainable CI/CD for a new codebase — GitHub Actions as the orchestrator, turbo/Nx for build caching, and drop-in fast runners to cut pipeline minutes.