Operational observability rocks, and so does Fly.io
Do you know what your infrastructure is doing? Not without good telemetry, you don't... Here's why an ops dashboard is so awesome.
Having fantastic observability metrics is a game changer. The problem is, too many teams think about building telemetry as an add-on, something to do later, once the system is up and running.
That’s totally backwards.
If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.
Anytime you'd like to read more, you can upgrade to a paid subscription.
Having great telemetry right from the start completely changes your approach to development. It gives you fantastic tools to explore unexpected problems, diagnosis root causes, and expose situations you wouldn't even be aware of until it’s too late. Like, in production, with customers groaning about an unusable web site.
Here’s a case in point. I love this story Fly.io just published in their latest newsletter:1
For this month's newsletter, something a little different, with apologies to anyone who reads the Fly.io Infra Log (you should! it's great! and only sometimes terrifying!), because this ran there first. Fly.io is today pleased to bring you the story of the petertron4000
.
One of the features we offer to applications hosted on Fly.io are a dense set of Prometheus performance metrics, along with Grafana dashboards to render them. Not enough of our customers have dug into this; they are really powerful, you can set alerts with them, forward them elsewhere, and, like a customer we're about to talk about, hold our feet to the fire.
One of the things our users can monitor with these metrics is TLS handshake latency. That is: when browsers make connections to our Anycast edge, and negotiate TLS, we track the time it takes to do that handshake (the negotiation, the certificate, the key exchange, etc). Why do we expose TLS latency metrics? We don't know. It was easy to do. Our metrics are very powerful and we take little credit for it because Prometheus, Vector, and Grafana are doing all the lifting.
Anyways, a consequence of exposing TLS latency metrics is that customers can, if they’re watchful, notice spikes in TLS latency. And so someone did. And they noticed something weird: a pattern of slow TLS handshakes. And they set an alarm on it. And it kept going off. And they reported it to us.
Slow TLS handshakes across our Anycast network are something we need to take seriously. So this set off several weeks of engineering effort to isolate the problem. Two things complicated the analysis.
First: looking at all of our own metrics, we could spot slow-than-usual latencies. But they weren't extreme, not like the customer was seeing.
Next, running packet-level analytics, we weren't able to spot any intra-handshake latency coming "from us"; there was never an instance where a browser sent us a TLS message, and we sat around for a couple hundred milliseconds twiddling our thumbs, and only later sent us a response. You send us a TCP SYN
? Pow! We get you a SYN+ACK
. You send us a TLS ClientHello? Blammo! A ServerHello
.
What we did notice were specific clients that appeared to have large penalties on “their side” inside the TLS handshake; a delay of up to half a second between the TCP ACK
for our ServerHello
and the ChangeCipherSpec
message that would move the handshake forward. In other words: we would send them our response, and those clients would (maybe) sit around twiddling their thumbs figuring out what to do next.
Reverse DNS for all these events traced back to AWS.
This was interesting evidence of something, but doesn't dispose of the question of whether there are events causing our own Anycast proxies to lag.
To better catch and distinguish these cases, what we needed was a more fine-grained breakdown of the handshake. What we want are packet-level analytics of every slow TLS handshake happening across our fleet. Problem: we do a lotof TLS handshakes. By time metrics notice a handshake is slow, it's too late to start capturing packets, so you need to be capturing all the time.
A global tcpdump
would produce a huge pcap
file on each of our edges within within seconds, and even if we could do that, it would be painful to filter out connections with slow TLS handshakes from those files.
Instead, Peter ended up:
making fly-proxy log all slow handshakes; and
writing a tool called
petertron4000
(real infra-log heads may remember its progenitor,petertron3000
) to temporarily save all packets in a fixed-size ring buffer, ingest the slow-handshake log events fromsystemd-journald
, andwrite durable logs of the flows corresponding to those slow handshakes.
With petertron4000
in hand, Peter was able to spot check a random selection of edges. The outcome:
Several latency issues we were able to trace to bad routing upstream.
Reoccurances of the issue described above, where clients are lagging, on “their side”, in the middle of the handshake, overwhelmingly sourced from AWS.
Peter was able to reproduce the problem directly on EC2: running 8 concurrent curl
requests on a t2.medium
reliably triggers it. Accompanying the TLS lag: high CPU usage on the instance. Further corroborating it: we can get the same thing to happen curl
-ing Google.
This was an investigation set in motion by an ugly looking graph in the metrics we expose, and not from any “organically observed” performance issues; a customer saw a spike and wanted to know what it was. We're pretty sure we know now: during the times of those spikes, they were getting requests from overloaded EC2 machines, which slow-played a TLS handshake and made our metrics look gross. :shaking-fist-emoji:.
Having said all that: dragnetting slow TLS handshakes uncovered a bunch of routing issues, and we're on a quest this summer to shake all of those out. Peter's new quest: how do you roll the petertron4000
(call it the petertron4500
, or the petertron4000DX2
) so that it can run permanently and generate alerts? Fun engineering problem! We have a bunch of ideas, and whatever happens will land on the infra-log. Tune in!
Also, check out Fly Metrics. They're great.
Good metrics are amazing
That’s a great story — it really explains how access to metrics is a game changer. Especially when you do it right — build it into your system, from day one. Then make sure your development team has full access to production telemetry data. Yes, your development team. I specifically did not say “operations team.”
Granted, your ops team is going to want access. That’s great. For that matter, put a big dashboard in the board room, so the C-Suite can look at your metrics too.
But the most important consumer of this information, in my book, is your development team.
The dev team needs ownership of their software, an active role in deploying it, watching it run, and figuring out why the code they wrote isn’t behaving the way it should.
This is long-term ownership and it’s the best way to build superb software: Share the responsibility of delivering and maintaining reliable software with the folks who wrote it.
What I love about Fly.io is how they’ve done the work for you. You get some fantastic observability tools out-of-the-box. Plus, Fly.io is so much easier to use than AWS, GCP, or Azure. They’ve put together an amazing platform — one that is a joy to use.
I highly recommend checking them out, even if it’s just for internal R&D.
If you find Customer Obsessed Engineering and the Delivery Playbook valuable, please share it with your friends and coworkers.
Fly.io, “Scalable Full Stack Without the Cortisol.” I use Fly.io for my own product development, R&D, and the Boss Logic web site. I’m not affiliated with Fly.io nor was I compensated for this post. I just really like the platform.