Ephemeral environments are the best way to fight “‘waste gremlins”

Save yourself from hundreds of hours of unnecessary work and headaches. Ephemeral environments are the key.

Nov 17, 2024

∙ Paid

Introduction

Ephemeral environments, or “on-demand” or “dynamic” environments, are lightweight and often short-lived replicas of your production environment. Except for scale, they mirror your production configuration to guarantee consistency in every deployment.

They’re created entirely automatically and can be used on-the-fly for experimentation, validation, testing and even to manage your release cycle.

The whole point of ephemeral environments is to streamline development and eliminate contention, consistency problems, and wasted effort. As a developer, unless you’ve been uniquely lucky, you’ve run into these problems — likely, many times.

If you’re new, welcome to Customer Obsessed Engineering! Every week I publish a new article, direct to your mailbox if you’re a subscriber. As a free subscriber you can read about half of every article, plus all of my free articles.

Anytime you'd like to read more, you can upgrade to a paid subscription.

The waste gremlins

Subtle inconsistencies between environments can be a real gremlin, constantly popping up and destroying productivity. I learned exactly how big an impact this can have when I agreed to help a product team solve some delivery issues.

The team was pushing a release out the door that same week. We had a demo scheduled for Friday, the end of the sprint — it was the first demo the team would be giving in over a couple of months. In my mind, this didn’t bode well. I find that teams who treat demos as part of their regular routine almost always do a better job of it.

As Friday loomed near, the team was still working on completing features. The demo seemed to be a last-minute afterthought. On Thursday, the team was scrambling to prepare. Code changes and commits were being made just hours before “go time,” and the database was still being configured and prepped with data that would make a plausible demo.

Friday morning the demo kicks off. A few powerpoint slides about what the team accomplished, and then the product is put on screen. It’s immediately clear that something is wrong — all the demo data is missing, so we’re basically looking at a blank page. “Whoops, looks like we have a little glitch,” says the team lead as everyone else scrambles to figure out what’s wrong. Not a great start.

After about 10 minutes of poking around, we find out a last-minute change had been made to the code — and that change had also pushed a change to the database configuration. Basically, it’s looking for a development database when it should be using the demo database. A few more minutes, the demo server configuration is fixed, and the demo is seemingly back on track.

Except it’s not, because after a couple of screens we’re seeing all kinds of problems. The application is acting strange, taking forever to load anything, and in general making for a really horrible demo. More scrambling, more apologies, “it’s working fine in dev, we’ll have it fixed in a moment.” We don’t actually get it fixed — instead, we stumble through a partial demo, showing screens that don’t work but, at least, look they way they’re supposed to. The team lead explains what each screen should be doing.

The demo is a flop. Afterwards, we do a quick retrospective to figure out what’s next. A few of the team take on diagnosing what the issues are, and we adjourn.

Over the next few days, the team comes up with a dozen different theories about what’s wrong, making corresponding tweaks to the code and redeploying. Every single time, we see software that seems to work just fine on a laptop utterly fail in the demo environment. Theories abound: the lightweight dev database is fundamentally different from the production database, the file system behaves differently, the server environment doesn’t have enough memory, and more.

It takes three days for the team to find the problem.

Diagnosing the root cause

During this time, I come to learn this experience is not unusual. In fact, it’s probably more typical than atypical. The last demo — over two months past — had similar issues. And the one prior to that. It inspires me to add up all the time wasted hunting for that gremlin of inconsistency.

I also learn about all the time spent managing different environments. Manually making changes to align with new application features, updating schemas, loading test data, and running various scripts. Every sprint there was work going into tweaking different environments.

A lot of time was being wasted across the team. There were two development servers, one testing server, and a staging server (which was also used for demos, and that dual-use had caused a few problems, too) — not to mention production, which thus far had been entirely hands-off. I added up 3 weeks of sprint time managing all the infrastructure. That’s across 7 people on the team, amounting to 105 days. (I made some assumptions, and total accuracy was impossible, but I stand by the figure being more or less accurate.)

It was Wednesday when the team lead excitedly pronounced they had found the root cause of the problem. It turns out it was a simple configuration issue on the staging server, a step that had been missed when prepping the demo environment — a step that was slightly different than usual because the server was used for both demos and staging. Ultimately, one single missed command had cost the team half of a week chasing a bunch of red herrings.

I told the team lead they had misdiagnosed the root cause.

The real problem

The real problem was that the team spent any time at all managing each of these different environments.

The team was struggling with resource contention for limited server resources. They were reusing one of those environments, but more important, the team was constrained in when and where they could build, deploy, test and experiment. The team constantly had to ask, “who’s using this environment? Can I use it? Will I interrupt someone else or overwrite their work?”
Environmental drift constantly caused problems. A limited number of development environments combined with rapidly changing application features meant servers where constantly tweaked, un-tweaked, and reconfigured. Environments that looked fine behaved oddly because someone, somewhere had made a change.
Lots of time was wasted chasing misdiagnosed problems. Tests would fail, and developers had no idea if it was a problem with new code, or a problem with the server configuration. Team members routinely shrugged, saying, “it works on my machine,” or “it works in dev.”
Inconsistent data led to mysterious problems. Sometimes the problem was poorly crafted or unrealistic sample data — while other times it was a legitimate bug. Not having a test data management (TDM) strategy meant more wasted time crafting the data, then figuring out if a problem was in the data, or in the code.
Inevitably, there was time contention — the team rarely accounted for all the infrastructure and environmental tinkering. That meant sprints routinely missed delivery targets. No wonder, since the team had to fuss with environmental setup (and diagnosing strange, unexpected problems) rather than writing code.

It all constantly, repeatedly wasted the team’s time, as well as money — from the wasted infrastructure. Those four “always on” environments were largely idle, none of them exceeding 10% usage on a month-to-month basis.

To me the real root cause was clear: poor controls and lack of automation, leading to inconsistency across all of these environments, wasted time, confusion, resource contention, and chaos. Not to mention, failed demos.

Fixing the real root cause

There was another problem that we had to address: the way the team thought about these seemingly disparate problems. It required a shift in perception. Rather than thinking about specific incidents — like the demo server misconfiguration — we had to pull back and look at the overall situation.

The team was treating each case as a unique problem, and fixing only that one problem.

Instead, we had to start thinking about why these problems kept happening. That meant thinking strategically, not tactically.

The root of the problem was actually that the team kept unreliably doing things that could be reliably automated. By automating all of the environment management, we could fix all of those problems in one shot.

If you use the referral button below, you’ll earn free premium access to Customer Obsessed Engineering. Just three referrals will earn a free month!

Refer a friend

We set about solving the problem once and for all. Our goal for the following sprint became:

Keep reading with a 7-day free trial

Subscribe to Customer Obsessed Engineering to keep reading this post and get 7 days of free access to the full post archives.