Plagued with problems getting to delivery? Solve them with a “steel thread."
5 principles to get architecturally complete features into production, control defects, and prove your concept-to-production pipeline in a few sprints.
While I was VP of Professional Services at Lightbend, our entire services team used a “steel thread” with our customers. It was transformative. It’s a technique for delivering an architecturally complete feature to production while getting rid of all the risky bits very quickly. Usually in just a few of sprints. It is hands-down the most effective project delivery approach I’ve ever used.
We used it at two of the “big three” Telcos with great success. In both cases, we shifted a delivery pipeline that took months into mere weeks. In one case, we solved a critical feature that resulted in significant loss of sales when a new phone was launched by Apple or Samsung. The development team had been blocked for the better part of a year with no visible path forward. We proved delivery could be done in four weeks using a steel thread, which the delivery team believed was impossible. But we did it, we delivered to production just days before the deadline. It was the first time any of the big three ran an “iconic launch” without watching their commerce site fall to its knees (for the launch of the iPhone 8).
If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.
Anytime you'd like to read more, you can upgrade to a paid subscription.
What is a steel thread?
The fundamental idea of a steel thread comes from bridge building. One of the most effective ways to build a bridge over a chasm is with a “steel thread,” literally a steel cable that gets shot across the chasm. With the thread firmly embedded on each side of the chasm, you start to scale it up — adding a framework of supports, and ultimately hanging the rest of the architecture off that framework to complete a bridge. But that initial steel thread is sufficient to actually cross the chasm, on a small scale. The basic goal is achieved right at the start.
In the context of software, a steel thread is an effective way to accelerate delivery of proven software into a production environment. It means shifting to a delivery method that deals with risk and complexity up-front, while also focusing on a proof of architecture during the first few working sprints. Essential functionality is a “thread” that runs throughout the system and this threads’ role in the system makes it strong “like steel.”1
Sounds like…
If you’re thinking it sounds a lot like run-of-the-mill iterative development, you’re (mostly) wrong. Yes, iteration is part of the solution, but this is not how most teams apply iterative development. Where the steel thread differs is in how we define goals for each iteration, or each sprint, especially early in the delivery cycle.
Most iterative approaches tend to focus on delivering complete features based on customer priority (or based on guidance from the Product Owner). The problem is that teams tend to avoid thinking about the system as a whole, instead focusing effort on the whole of a single feature.
Here’s an example: Let’s say you want to build a brand new, single sign on identity and access management system. It’s going to handle user accounts, user data management, authentication, service authorization, protecting personal information, and the basics of creating an account, changing your password, and deleting your account. It’s part of an event driven architecture, and for GDPR we need to support the removal of all personal data.
Most teams will tackle building it something like this:
The most obvious, highest priority is going to be creating a new account. That’s got to be number one (it’s a pretty obvious precursor to using the system).
We’ll need authentication (so we can identify a returning user). That’s probably got to be in the first delivery.
And it’s hard to use the system without service authorization (knowing which services a user can access), so we’ll include that.
Our product team says they’ll need to store data associated with a user account (things like preferences, customer payment details and the like). But that comes later.
Supporting GDPR and the “right to forget” sounds complicated, plus we don’t need it right away, so we can tackle that later.
Likewise, the event driven stuff isn’t needed right away — save that for later.
Changing the user name or password is easy, and we don’t need it until we go live, so save those features for later.
And then we’ll get down to building the first building block: A way to sign up for a new account, authenticate, and get access to parts of the product. Everything starts out great. We have our sign-up feature working in the first sprint.
What could possibly go wrong? Do any of these sound familiar?
Every attempt to push to staging or production introduces a few new surprises. New components, new pieces of the architecture keep getting added in, and each time it complicates and changes delivery. Getting anything ready for integration testing is increasingly difficult.
As you get further and further into the project, problems become more challenging and progress slows. A lot of the tough questions that should have been answered early have been put off — like, how are we going to “forget” a user’s data once it ends up all over our system? The team is realizing there are some big unknowns that are blowing up the project timeline.
Changes keep popping up. Maybe new features or requirements, or clarifications on old ones, mean you have to go back and fix code that’s supposed to be done by now. Sometimes its details about a feature finally surfacing, and proving to be much, much more involved than hoped. There isn’t enough time to get things ready for launch.
Your team struggles to move out of the development environment into a production system because the software is rigid. You have hardcoding and shortcuts driving up your tech debt. Implementation details have been built into the code and only one or two team members seem to know the “magic” to get the system running.
You have different teams working on different pieces, and nobody has a solid understanding of the whole product. You don’t have a clear picture of what other teams are doing, or how it relates to your own work. There is no clear path to get the entire product done and in production.
There is no easy way to see your most recent code working in a real, production-like environment. You can test your changes in your own little sandbox, but you know when you get your code into the main branch you’ll have problems and will have to fix something.
The application is fragile and dependent on the environment. Any time you want to deploy to a different environment — a new development environment, a review app on staging or a test server, or getting to production — someone has to do something by hand. It slows everyone down, and if the environment changes you might have to change some code. It’s a never-ending cycle.
You feel like you have a huge bag of parts, but nobody has tried to assemble everything to ascertain completeness or correctness of function. How the whole system comes together isn’t clear, and the scope and scale of doing that is growing out of control.
There’s a “we’ll fix it later” attitude in play, even though most of the team recognizes it’s a fallacy and projects never actually get the time or resources to go back and fix it later.
All of these problems are late stage problems. They don’t show up until you’re pretty far into a project — often, so far in that it’s too late. The team starts compensating by taking drastic action: Cutting out features, working overtime, making compromises and shortcuts. Most often it leads to missed deadlines and burned out teams.
How is a steel thread different?
We want to refine our thinking about what an initial delivery goal is, and in so doing we can get rid of a host of problems that plague development teams.
Most teams are looking at breaking apart a list of requirements into discrete features, then shipping one complete feature. But that leads to all of the problems we just talked about. If we only build one tiny part of the system, we’re leaving potential land mines in our path. All those other pieces we didn’t think about could blow up in the future.
But we can’t just build it all, either. We still have the constraint to get something into production fast. Iterative development.
So how do we balance these apparently contradictory goals? We start by redefining a few core principles.
If you use the referral button below, you’ll earn free premium access to Customer Obsessed Engineering. Just three referrals will earn a free month!
All the way to production on day one
Let’s define what “done” means: “It’s only done if it’s working in production.”
By “all the way to production,” I don’t necessarily mean open to the world and usable by anyone. What I do mean is deployed to a production-ready server where it can be used by real users. The production ready server needs to be identical to the real production environment:
Real infrastructure that mirrors the production environment or, even better, is the actual production environment.
Deployment to the environment is handled “for real” using your pipeline.
I can’t overstate how important this is. The deployment environment has to be a production replica, exactly like the real production environment. Likewise, the deployment mechanism needs to be real. By making it real, we make it reconfigurable, and we don’t waste any time adjusting it or fiddling around trying to get things from development to production.
Nothing is complete on its own
Essentially this means delivering a thin slice of the entire product architecture all the way to production as quickly as possible. Ideally, this happens in a few of weeks, not months. The slice of the architecture needs to touch on most of the major components of the system — but each piece can be incomplete.
In other words, you need enough to put the component into the delivery and demonstrate it, but it doesn’t need to be finished. For example, if you’re delivering an event streaming system that uses Kafka, you’ll ideally stand up a simple Kafka instance with a single topic, and a few components can dump messages on the topic. It’s not done, there will surely be much more infrastructure to accommodate automatic scaling, multiple topics, and a host of other features — but it’s there, and it’s working, and can be deployed to your production environment.
Answer the tough questions first
Our goal in delivering a complete slice of architecture over a working end-to-end pipeline means we have to “touch all the pieces.” That means implicitly prioritizing all the difficult problems. Going back to our example case: We can’t deliver an event streaming architecture to production without standing up Kafka, and having a few components write events to it.
That means we have to solve at least the high level tough problem of “what is going to be our event log?” In this case, it’s Kafka, and we have to include it. One more risk out of the way. True, we could have said “we’ll go back and add that later,” but that’s just kicking the can down the road. We need to know it will work, and we need to have all the major architectural pieces in play from day one. We will go back later to tackle things like autoscaling and multiple topics — but only after we get the first product increment into production.
Wasting effort is unacceptable
Anything we do by hand is not adding value. Sure, it only takes a few minutes to go start up a new server instance. And a few to deploy your code there. And just a moment to change a couple configuration files. And another thirty seconds to start the server. And then it takes five minutes to do it again. While everyone is waiting. The net-net amounts to a huge loss of time over the project, so don’t do it.
If we’re striving to have a functional deployment, in production, within a matter of weeks we need to rely on automation. Never do anything manually, instead, use the time to build automation. Your deployment pipeline absolutely must:
Be fully configured and managed through Infrastructure as Code.
Automate all of the steps necessary to deploy to any given environment.
Provide enough feedback to give you clear guidance on what went wrong, if something fails.
By the time we put code on the production server, an end-to-end skeleton of our pipeline should be in place — but it might not be complete. Start with the basics: Run automated tests, push configuration into the environment, deploy the code, and start the target environment up. Chapter 2.8 Delivery processes & tools in the playbook goes deep on this topic.
Later we can add to it by hanging new capabilities off the framework. Over time we’ll start doing container scanning, penetration testing, and performance testing to name a few — all using our totally automated deployment pipeline.
Automation is the only path to higher environments
That goes for testing as well — the only way to move from a lower environment (like development) to a higher environment (like production) is with automation. That means verification and testing is done in code:
Testing needs to be fully automated; no manual test steps.
Ideally, engineers write most of the tests.
All functional and UX testing should be built by the development team and be part of the codebase. You can’t automate moving to higher environments if it’s not.
The only “downstream” testing should be penetration and performance testing.
I’m a huge advocate for test driven development (TDD) in its true form. That does not mean writing all your test code up front. It means writing a single test, making it pass, and then iterating until you have enough solid code that it can go out the door.
A steel thread feature increment
With our new principles in hand, let’s revisit our use case. To recap, we’re building an identity and access management system (IAM) system:
It creates and manages user accounts.
It supports authentication (identifying a user), as well as service authorization (access to a specific service).
It stores personal information (like customer payment methods).
It’s GDPR compliant with a “right to forget” (meaning we must be able to “forget” all personal data).
It taps into the event stream, listening for and emitting appropriate events.
Using a steel thread approach, we have to implement an architecturally complete slice of the product. We can’t just create a new user, and leave the rest for later. So our “day one” goal includes tackling all the unknowns up front, especially some of the tough ones.
It doesn’t have to be a complete product. We can go back and add new features (like the ability to change my password) later.
But all the major pieces have to be there: It has to integrate with the event stream, emitting events like “new user created.” It has to support GDPR by letting me completely “forget” all the personal data in the account.
With this new perspective, we can redefine our first feature increment:
One of the hardest things is the “right to forget.” That’s got to be part of the first deliverable — a user can tell us to “forget” all their personal data. It’s hard because user data can spread throughout the system, making it a tough problem.
The system uses an event streaming architecture. Build that from day one. This also makes #1 difficult.
Obviously, to support #1 and #2, we need to be able to receive a request to “forget” a user — which means we also need to create a user account, and store some forgettable data in it.
If a user changes their password, we need to make sure the change propagates throughout the system fast. That can be an interesting problem, so we should handle it up front.
We have to push it to production using an automated pipeline.
We’ve completely redefined our priorities. Notice that most of the things we thought weren’t important are now part of our first delivery! We’ve also added a new one: Our pipeline has to be automated — remember, the only path to higher environments is through automation.
This sounds like a lot, and it’s absolutely correct to say the scope of our first increment just got quite a bit bigger. That’s ok, because we have also prioritized all of our risk right up front. By the time we finish this first iteration, we’ll have an end-to-end working product. It will include all the major architectural components we rely on, like event streaming and an IAM component that lets us store and forget personal information.
It might take more than one sprint. That’s fine. The key is to refine it until you are delivering just enough to call it architecturally complete. With the architectural skeleton complete, we won’t have any major unknowns surprising us in the future. The project will get easier — not harder — as we near the finish line.
Prioritize the tough problems
What we’ve done is prioritize the tough problems. This is important because it’s the tough problems that tend to blow up later. With our new delivery goal in mind, we’re going to focus our efforts in a different way.
First we have to think about how the “right to forget” affects our fundamentally immutable event streaming system. If we write personal data to the event log, and we can never go back and change the event log, how do we “forget” that data?
We have to make it work, so that means building on a solid foundation: The event stream becomes our single source of truth, a pillar we build on. It has to be there.
Sensitive changes in the system have to happen quickly, and everywhere. Like changing a password, or the “right to forget.” How do we make that happen across the system very quickly?
We can stand up a Kafka server day one — maybe we’ll swap it out or refine it later, but it’s an event log and it works. That gives us a working foundation, and it only takes a couple hours to set up. Refining it with autoscaling, multiple topics, and all that can happen later. We don’t need it to be finished we just need it to be a working part of the architecture.
Now we can start sending and receiving events. Creating a new user is pretty easy, that just sends a “new user” event to the event log. But now we can start tapping into the event log and listening for important events — like a “password changed” event. Components can receive that event and immediately kick the user out of the system.
The GDPR requirements are an interesting challenge though. With all these different systems potentially listening to (and storing a copy of) user data, how do we forget that data? The answer is cryptographic shredding. Any personal data will be encrypted in the user record, and in the event stream. The only way to read the data is to get the encryption key from a secret store — so that means we have to include something like Hashicorp Vault in our day-one delivery. But now we can instantly “forget” all the users’ personal data by deleting their cryptographic key. Once that’s gone, even the data on the event store can’t be decrypted.
Now we’ve explored and tackled the toughest problems early — which means, if something doesn’t work right we’ll find out early, and we have time to explore other options.
We’ve done it — we’ve delivered an architecturally complete slice of the system to production. From here on out, the rest is relatively easy: Adding features like changing user details, adding more personal data, and setting up authorization.
And this is exactly what we did, in two sprints, with one of our most recent clients and a team of two engineers.
A steel thread transforms delivery
I already said the steel thread was transformative. We do it with a relative simple change in how we prioritize work: Focus on “just enough” functionality to deliver an architecturally complete foundation.
All the way to production on day one. It’s got to work “for real,” but allow the product to reach production with unfinished and minimal functionality.
Nothing is complete on its own. Focus not on a “product feature” but on delivering “architectural completeness.”
Answer the tough questions first. Deliver end-to-end functionality, eliminating unknowns and risks as early as possible.
Wasting effort is unacceptable. Get rid of manual work and waste by automating configuration and deployment from day one.
Automation is the only path to higher environments. The code has to verify correctness, which means building tests in from the start.
Following these principles you’ll find the problems raised in the first half of this article are naturally dealt with.
You eliminate risk early and make sure there are no land mines in your path. By tackling the tough problems first and leaving the simple problems for later, variability is front-loaded. Later stages of the project smooth out and become predictable.
You prove the architecture works. This means you won’t be in the 11th hour when you realize your basic assumptions are no good. (Imagine realizing the complexity of the “right to forget” long after an immutable architecture had been set in stone).
There are no more surprises. With all the major parts in the system, you won’t have nasty surprises when someone introduces a major new component, and you won’t find yourself making drastic changes to code you thought was done.
A complete delivery pipeline saves time every single day. Emphasizing automation and Infrastructure as Code naturally pushes hardcoding and shortcuts out of the system. That saves time in a compounding manner, and speeds up delivery.
No more “magic” needed to deploy. With a pipeline that everyone can use, there are no more complicated tricks to get things deployed. Anyone can do it, any time of day, which improves visibility and awareness across the organization.
You have a known solution and end-to-end functionality from day one. No more “bag of parts” that nobody knows how to assemble — instead, you have a working system with clear visibility into what’s done, and what’s not done.
Silos tend to break down now that bottlenecks and unknowns are eliminated.
No more “we’ll fix it later” fallacy, because you did it right the first time.
A steel thread is one tool you can pull out of your toolbox. It’s one I recommend using every single time. In future articles I’ll expand on using more strategic approaches with systems design and project execution. You’ll also find the Delivery Playbook provides a framework for success that you can quickly adopt.
If you find Customer Obsessed Engineering and the Delivery Playbook valuable, please share it with your friends and coworkers.
Narayanappa, Bae, Alkobaisi, Debnath, “Steel Threads: Framework for Developing Software System Architecture”, https://www.cs.du.edu/~snarayan/sada/docs/steelthreads.pdf.