Observability as architecture: why you can’t bolt it on later
Most teams treat observability like a nice-to-have. They’re building blind — and paying for it in ways they can’t even see.

I once joined a team that had been struggling with delivery for the better part of two years. Demos kept failing. Deployments kept crashing production. Rollbacks were so routine that the ops team had gotten eerily efficient at them. (I’ll refer back to the team later — let’s call the company “Acme”).
I remember one in particular. The team was demoing a reactive search feature — the kind of thing where you start typing a product name and suggestions appear in real time. Same idea as typing “penguins” into Google and getting “Penguins of Madagascar” as a suggestion.
The demo cratered. Someone typed a product name, and nothing happened. Twenty minutes of poking around later, the demo was cancelled. What struck me wasn’t the failure itself — it was the reaction. Nobody was surprised. Nobody expected it to work. “Maybe next time.”1
Later that afternoon, a planned deployment seemed to have crashed a key service. All hands on deck for a rollback. The team executed quickly. Rollback done. They’d had a lot of practice.
Here’s what immediately jumped out at me: in both cases the team had zero operational intelligence. No telemetry. No metrics. No centralized logs. No traceability. No way to quickly zero in on a root cause. They were flying blind — and they’d been flying blind so long they’d forgotten what visibility even looked like.
And that’s why the reaction was the same each time. Abort. Rollback. “Afterward, we’ll figure out what’s wrong.”
The irony is that both failures had quick fixes. Each time, queries weren’t reaching a service because the wrong port was configured. Instead of changing the port and continuing, the team burned hours — even days — backtracking, diagnosing and trying again.
That experience crystallized something I’d been feeling for decades: observability isn’t a tool you add. It’s an architectural decision you make — or don’t. And the cost of not making it compounds in ways that are genuinely difficult to measure, precisely because you don’t have the instrumentation to measure them.
If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.
Anytime you’d like to read more, you can upgrade to a paid subscription.
The “flying blind penalty”
In the old days — before cloud, before microservices, before systems became truly distributed — opacity was tolerable. You had a monolith on a few servers. If something broke, you SSH’d in, looked at logs, poked around. It might take hours, but there wasn’t that much territory to cover.
Modern systems don’t afford that luxury. A single user request could traverse dozens of services, touching databases, message queues, caches and external APIs along the way. Just today I participated in a sprint review discussing a database architecture that was 29 views deep. A latency spike could originate in any one of those — or in the invisible interplay between the services that call them.
Without observability, you’re operating on instinct and prayer.
Here’s what that looks like in practice. A bug ships. It doesn’t manifest right away — it surfaces under load, or in a specific user segment, or under certain timing conditions. By the time anyone notices, users are already suffering. Now the team has to diagnose the problem. If they’re lucky, someone complains and provides enough detail to be helpful. They narrow down which service is involved, reproduce the issue, trace the flow of data through the system, find the root cause, fix it, deploy and verify.
The whole time, sweating over how many more customers are running into the problem.
Every single step without observability compounds in difficulty. You’re not just debugging — you’re doing archaeology at midnight, piecing together behavior from scattered artifacts across dozens of services.
With observability designed in from the start, you proactively see which service is degraded. You can drill into specific request traces. You can correlate logs across service boundaries. Mean time to detect and repair drops from hours to minutes — even becoming predictive.2
DORA.dev research bears this out. Elite-performing teams — the ones deploying multiple times per day with sub-hour recovery times — aren’t elite because they have better engineers. They’re elite because visibility closes a critical gap: it’s a prerequisite for every other capability DORA measures. They can detect and diagnose problems before those problems cascade. Their systems are instrumented to answer questions the team hasn’t even thought to ask yet.3
What matters is how you think about the problem
Monitoring and observability are not the same thing.
Monitoring answers questions you anticipated. You watch CPU utilization, disk space, error rates. You set thresholds and alert when they’re exceeded. “Is the database up? Is the error rate spiking?” These are known questions about known conditions.
Observability answers questions you haven’t thought of yet. It’s the ability to interrogate your system’s behavior without adding new instrumentation every time you have a new question. “Why is this specific user’s checkout so slow today when it was fine yesterday? What changed in the request path between those two deployments?”
Charity Majors — co-founder of Honeycomb and one of the sharpest voices on this topic — frames it as the difference between known-unknowns and unknown-unknowns. Monitoring handles the known-unknowns: the things you anticipated might go wrong. Observability handles the unknown-unknowns: the things you couldn’t have predicted. It empowers you to follow the breadcrumbs wherever they lead.4
Monitoring is reactive. Observability is generative — it creates the conditions for discovery.
Daniela Miao, CTO of Momento, put it well in a Dev Interrupted episode: “By giving more visibility into what’s happening with your software, your application, you actually do enable development to be faster as well. It’s not just insurance for when something goes wrong.”5
That’s critical. Most teams think of observability as insurance — you invest in it to cushion the blow when things break. But the real dividend is velocity. Teams with strong observability ship faster because they can see the impact of their changes in real time. They don’t have to guess whether a deployment worked or a tweak to a query ran faster. They know.
Why we don’t instrument on day one
Early stage. You’re shipping fast. Three or four services, maybe. Errors go to a text file. You’re small enough that you can hold the whole system in your head. Observability feels like overkill. “It’s in the logs, just read ‘em.”
Growth. Suddenly there are ten services, and logs are scattered across different servers. You can’t correlate them anymore. Someone says “we should get a logging tool.” You pick one — Datadog, Splunk, whatever — and bolt it on. The instrumentation is inconsistent. Some services are covered, others aren’t. Metrics are sparse and don’t tell a coherent story. You’re reading fragmented logs from disparate systems and trying to correlate events in your head. It’s starting to fall apart.
Pain. Now you have fifty services. Traffic has increased tenfold. Something breaks, and you’re still flying blind — not because you lack tools, but because the observability layer wasn’t designed. It was stitched on, piecemeal, after the fact. Different teams instrumented different things using different conventions. Tracing is incomplete. Metrics don’t correlate with logs. You’re paying more and more for tools that don’t give you the insight you need.
Here’s what’s easy to miss: that team invested in observability. They bought the tools. They checked the box. But because every investment was reactive — a logging platform when logs got painful, a metrics dashboard when someone asked for one, a tracing system when debugging got unbearable — they ended up with three separate systems that don’t talk to each other. Logs in one silo, metrics in another, traces in a third. Every decision about what to capture was made at write time, under pressure, for a specific fire they were fighting that week.
Charity Majors, co-founder of Honeycomb, calls this Observability 1.0: many pillars, many sources of truth, many tools. It’s what you get by default when you bolt observability on after the fact. The fragmentation isn’t a failure of effort — it’s a consequence of timing. You can’t design a unified telemetry model when you’re patching gaps reactively.6
And it matters — because a fragmented model can answer the questions you anticipated (is the database up? is the error rate spiking?) but collapses when you need to ask something new. “Why are checkouts slow today when they were fine yesterday? What changed between those two deployments?” Those questions require slicing across dimensions you didn’t pre-define. They require the kind of open-ended interrogation that siloed tools simply can’t support.
Insight requires design. That’s the part teams miss.
Why does this happen? Because observability isn’t a feature. It’s not something you built first, it’s what got built last. It doesn’t generate revenue — at least not visibly. It feels expensive up front. And teams, under pressure to deliver, don’t demand the time they need for planning a solid foundation.
I’ve written about the same false economy with security and testing. The things that feel expensive to build at the start turn out to be astronomically more expensive to retrofit.7
Why it has to be architectural
You can’t add observability with a library import. You can add logging and metrics — that’s the easy part. But genuine observability, the kind that lets you answer arbitrary questions about system behavior, requires architectural forethought.
Consider what’s involved:
How do requests get identified as they flow across boundaries? Trace IDs need to propagate through HTTP headers, message queue metadata, database session context — everywhere a request touches. If you don’t design this in, you can’t correlate activity across services.
What events merit recording? Not every line of code should produce a log. That’s noise. But the meaningful transitions — state changes, boundary crossings between services, external calls and their response characteristics — these are the signals that matter.
What metrics actually reflect system health? Not just CPU and memory, but application-level indicators. How long do payment transactions take? What’s the cache hit ratio? What’s the queue depth trend? These are the metrics that tell you whether your system is behaving — or trending toward overload.
How does context propagate? When a request carries metadata — a customer ID, a feature flag, a request source — how does that context flow through the system in a way that key transitions are captured seamlessly?
This is the shape of how you structure services, how you handle requests, how you organize code. And capturing it after the fact is brutally hard.
For example: if you decide later that you want end-to-end tracing, you need to add trace ID generation at your entry points, thread those IDs through every service-to-service call, correlate logs using those IDs, and store traces in a specialized backend. That means touching every API, every RPC, every async boundary. If you’d built it in from the start, it’s a few lines of middleware.
Consider structured logging: if you’ve been writing freeform string logs for two years, migrating to structured JSON events means auditing every logging statement, retrofitting context and rewriting every query your team depends on. If you start with structured logs, it’s just how things work from day one.
It’s not expensive when you build it in. It’s a handful of decisions, many of which improve things by introducing consistency. OpenTelemetry standardizes the plumbing — trace propagation, metric collection, structured logging — across languages and frameworks. You focus on what to instrument, not how to propagate context.8
But retrofitting: that’s an ordeal. Orders of magnitude more expensive — in engineering time, in tool sprawl and in the debugging sessions that never needed to happen.
What to actually build
Here’s how to do it right — whether you’re making the investment in an old project or starting fresh. The key is to think in events, not pillars.
Start with events, not tools
The most consequential decision you’ll make is how you model your telemetry. Do you emit three separate streams — logs here, metrics there, traces somewhere else — and try to correlate them after the fact? Or do you start with a single primitive: one structured event per unit of work?
The event-first approach works like this: when a request enters your system, you begin composing a structured event — a wide, context-rich record that accumulates everything meaningful about that request as it flows through your services. By the time the request completes, the event carries the trace ID, service name, duration, user ID, feature flags, error state, dependency latencies, cache hits, queue depths — whatever context matters.
You store that event once. From that single store, you can derive everything else: aggregate events over time and you get metrics; filter and read individual events and you get logs; stitch events together by trace ID and you get traces. One source of truth, many views.
This is what Charity Majors calls Observability 2.0 — and it’s the architecture that actually delivers on the promise of answering questions you haven’t thought of yet. Because the events are wide (carrying dozens or hundreds of fields), you can slice by any dimension after the fact. You didn’t have to anticipate the question at write time. You just ask it: “show me all events where cache_hit=false and region=eu-west and latency_p99 > 2s.” If those fields are in the event, you get your answer.
Compare that to the three-pillars approach, where you'd need to have predefined that specific metric, logged that specific combination, and hoped your tracing tool captured the right spans. Three tools, three queries, three partial answers — if you’re lucky.
The event model is the architectural decision. Everything else follows from it.
This newsletter grows by word of mouth… I’d really, truly appreciate it if you could refer a friend. Your referrals make it worthwhile. And this button earns you free access!
Establish a trace ID convention
Every request gets a unique identifier the moment it enters your system. That ID propagates through every downstream call — HTTP headers, message metadata, database session context. Every event includes it.
This is foundational. Without it, you cannot correlate activity across service boundaries. With it, a single ID connects every event across every service for a given request. It’s what lets you stitch individual events into a complete trace — the full journey of a request through your system.
OpenTelemetry handles those mechanics gracefully. Use it. It’s language-agnostic, framework-aware and it solves the propagation problem so you don’t have to.
Enrich your events at every boundary
Not every line of code should produce telemetry. That’s noise. The moments that matter are the boundaries and transitions — the points where a request crosses from one domain to another or where state changes meaningfully:
Request entry — what arrived, who initiated it, what context it carried.
External calls — what you called, what came back, how long it took.
State changes — meaningful transitions: a payment processed, a queue depth shifted, a circuit breaker tripped.
Errors — always, with full context. Stack traces, request metadata, the works.
At each of these boundaries, you’re not firing off a disconnected log line — you’re enriching the event that represents this unit of work. Add fields: payment_gateway_latency_ms=302, cache_hit=false, feature_flag_new_checkout=true. The event gets wider and richer as the request progresses. When the request completes, it captures the full story.
Make every field structured and discrete. Not a freeform string like “User 12345 logged in from 192.168.1.1” — but named fields: user_id=12345, source_ip=192.168.1.1, session_id=abc, duration_ms=47. Structured fields are queryable. “Show me all login failures for user 12345 in the last hour” becomes a single query instead of a regex excavation.
Derive your metrics from the same data
Once you have rich, structured events flowing through a single store, metrics become a view — not a separate system. You aggregate events to spot trends:

