Plagued with problems getting to delivery? Solve them with a “steel thread."

5 principles to get architecturally complete features into production, control defects, and prove your concept-to-production pipeline in a few sprints.

Jan 20, 2024

∙ Paid

While I was VP of Professional Services at Lightbend, our entire services team used a “steel thread” with our customers. It was transformative. It’s a technique for delivering an architecturally complete feature to production while getting rid of all the risky bits very quickly. Usually in just a few of sprints. It is hands-down the most effective project delivery approach I’ve ever used.

We used it at two of the “big three” Telcos with great success. In both cases, we shifted a delivery pipeline that took months into mere weeks. In one case, we solved a critical feature that resulted in significant loss of sales when a new phone was launched by Apple or Samsung. The development team had been blocked for the better part of a year with no visible path forward. We proved delivery could be done in four weeks using a steel thread, which the delivery team believed was impossible. But we did it, we delivered to production just days before the deadline. It was the first time any of the big three ran an “iconic launch” without watching their commerce site fall to its knees (for the launch of the iPhone 8).

If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.

Anytime you'd like to read more, you can upgrade to a paid subscription. A paid subscription also gives you full access to the playbook.

What is a steel thread?

The fundamental idea of a steel thread comes from bridge building. One of the most effective ways to build a bridge over a chasm is with a “steel thread,” literally a steel cable that gets shot across the chasm. With the thread firmly embedded on each side of the chasm, you start to scale it up — adding a framework of supports, and ultimately hanging the rest of the architecture off that framework to complete a bridge. But that initial steel thread is sufficient to actually cross the chasm, on a small scale. The basic goal is achieved right at the start.

In the context of software, a steel thread is an effective way to accelerate delivery of proven software into a production environment. It means shifting to a delivery method that deals with risk and complexity up-front, while also focusing on a proof of architecture during the first few working sprints. Essential functionality is a “thread” that runs throughout the system and this threads’ role in the system makes it strong “like steel.”1

Sounds like…

If you’re thinking it sounds a lot like run-of-the-mill iterative development, you’re (mostly) wrong. Yes, iteration is part of the solution, but this is not how most teams apply iterative development. Where the steel thread differs is in how we define goals for each iteration, or each sprint, especially early in the delivery cycle.

Most iterative approaches tend to focus on delivering complete features based on customer priority (or based on guidance from the Product Owner). The problem is that teams tend to avoid thinking about the system as a whole, instead focusing effort on the whole of a single feature.

Here’s an example: Let’s say you want to build a brand new, single sign on identity and access management system. It’s going to handle user accounts, user data management, authentication, service authorization, protecting personal information, and the basics of creating an account, changing your password, and deleting your account. It’s part of an event driven architecture, and for GDPR we need to support the removal of all personal data.

Most teams will tackle building it something like this:

The most obvious, highest priority is going to be creating a new account. That’s got to be number one (it’s a pretty obvious precursor to using the system).
We’ll need authentication (so we can identify a returning user). That’s probably got to be in the first delivery.
And it’s hard to use the system without service authorization (knowing which services a user can access), so we’ll include that.
Our product team says they’ll need to store data associated with a user account (things like preferences, customer payment details and the like). But that comes later.
Supporting GDPR and the “right to forget” sounds complicated, plus we don’t need it right away, so we can tackle that later.
Likewise, the event driven stuff isn’t needed right away — save that for later.
Changing the user name or password is easy, and we don’t need it until we go live, so save those features for later.

And then we’ll get down to building the first building block: A way to sign up for a new account, authenticate, and get access to parts of the product. Everything starts out great. We have our sign-up feature working in the first sprint.

What could possibly go wrong? Do any of these sound familiar?

Every attempt to push to staging or production introduces a few new surprises. New components, new pieces of the architecture keep getting added in, and each time it complicates and changes delivery. Getting anything ready for integration testing is increasingly difficult.
As you get further and further into the project, problems become more challenging and progress slows. A lot of the tough questions that should have been answered early have been put off — like, how are we going to “forget” a user’s data once it ends up all over our system? The team is realizing there are some big unknowns that are blowing up the project timeline.
Changes keep popping up. Maybe new features or requirements, or clarifications on old ones, mean you have to go back and fix code that’s supposed to be done by now. Sometimes its details about a feature finally surfacing, and proving to be much, much more involved than hoped. There isn’t enough time to get things ready for launch.
Your team struggles to move out of the development environment into a production system because the software is rigid. You have hardcoding and shortcuts driving up your tech debt. Implementation details have been built into the code and only one or two team members seem to know the “magic” to get the system running.
You have different teams working on different pieces, and nobody has a solid understanding of the whole product. You don’t have a clear picture of what other teams are doing, or how it relates to your own work. There is no clear path to get the entire product done and in production.
There is no easy way to see your most recent code working in a real, production-like environment. You can test your changes in your own little sandbox, but you know when you get your code into the main branch you’ll have problems and will have to fix something.
The application is fragile and dependent on the environment. Any time you want to deploy to a different environment — a new development environment, a review app on staging or a test server, or getting to production — someone has to do something by hand. It slows everyone down, and if the environment changes you might have to change some code. It’s a never-ending cycle.
You feel like you have a huge bag of parts, but nobody has tried to assemble everything to ascertain completeness or correctness of function. How the whole system comes together isn’t clear, and the scope and scale of doing that is growing out of control.
There’s a “we’ll fix it later” attitude in play, even though most of the team recognizes it’s a fallacy and projects never actually get the time or resources to go back and fix it later.

All of these problems are late stage problems. They don’t show up until you’re pretty far into a project — often, so far in that it’s too late. The team starts compensating by taking drastic action: Cutting out features, working overtime, making compromises and shortcuts. Most often it leads to missed deadlines and burned out teams.

How is a steel thread different?

We want to refine our thinking about what an initial delivery goal is, and in so doing we can get rid of a host of problems that plague development teams.

Most teams are looking at breaking apart a list of requirements into discrete features, then shipping one complete feature. But that leads to all of the problems we just talked about. If we only build one tiny part of the system, we’re leaving potential land mines in our path. All those other pieces we didn’t think about could blow up in the future.

But we can’t just build it all, either. We still have the constraint to get something into production fast. Iterative development.

So how do we balance these apparently contradictory goals? We start by redefining a few core principles.

If you use the referral button below, you’ll earn free premium access to Customer Obsessed Engineering. Just three referrals will earn a free month!

Refer a friend

Keep reading with a 7-day free trial

Subscribe to Customer Obsessed Engineering to keep reading this post and get 7 days of free access to the full post archives.