Automate all the things with live documents
Documenting is a pain. Unless… our code, documents, and tests are all one-and-the-same. Executable runbooks and doctests!
There’s never enough time to get everything done.
It’s loads of fun to just “code stuff up” and get it done, show off a fancy demo, and hear everyone oooh and ahhh. Or, build out a slick CI/CD pipeline that gives each developer their own ephemeral environment, tests the heck out of new code, blocks all the threats, and then shuts itself down all without incurring any noticable cost. That gets plenty of oooh’s and ahhh’s too.
But now I have a problem — it only works because I know how to keep it working. So… time to write some good documentation. Get those APIs discoverable. Document all the DevOps magic so someone else can understand what I just built.
Except that there’s always more to do — and if I write all that documentation, then I’ve got to maintain it too. Right?
Everyone hates writing documentation. But, everyone loves it when things “just work” and their job is easy-peasy. So, why not do it right and reap the benefits? We should be writing “living documents” that actually make our jobs easier.
For instance, we can build code-level documentation that implements our tests and produces quality end-user documents.
How about this — all those things your DevOps team has to do manually? What if we could document and automate those things, saving time and get critical knowledge out of one person’s head and into easy to use automated runbooks, once and for all?
Let’s take a look at both.
Tests from living documentation
While writing code there’s a constant struggle to keep everything in-sync and current. Does your documentation reflect the current product state? Do your tests actually test all of the features and recent changes in the system? Is the code itself easy to understand as a whole system?
If I fix a bug, do I have to update the tests and the documentation too?
The good news is, there’s an easy way to do it all in one step. The bad news is, at least in my experience, very few teams take advantage of it.
The concept is pretty simple: We need documentation in our code that produces excellent end-user and API documentation, stays in-sync with the latest code changes, and also implements our automated tests. That last piece is the glue that makes it all work. Just by writing good tests, we’re also writing good documentation.
Elixir made this a first-class citizen of the language, building features into the language to turn inline comments into both documentation and feature tests. Other languages have similar features, often from third party libraries.
Here’s a trivial example: The following code defines a parse(line) function in the KVServer.Command module. The inline documentation explains what the function does and then goes on to provide several example usages — both correct usage, and incorrect usage, and what the expected result of each would be:
First off, we get instant, dynamic documentation. As soon as I’ve finished writing the code, I can hover my mouse over the parse(line) function and I get nicely formatted documentation, an instant win for anyone using my function:
Much more interesting, though, is that my unit tests have already been written.
Notice the six examples I’ve included in the documentation — four good use cases, and then two that return an error state. For example, the call to delete an item from the shopping bucket is documented as:
iex> KVServer.Command.parse “DELETE shopping eggs\r\n”
{:ok, {:delete, “shopping”, “eggs”}}
That’s my test case. I don’t have to create a new test class, write a separate unit test, and keep my documentation, my unit tests, and my code in sync — because Elixir’s doctest feature will automatically treat the documentation as my test case. The command KVServer.Command.parse(“DELETE shopping eggs\r\n”) is run by the test harness, and a tuple containing an :ok and a result is expected back.
I can run my doctest suite of tests using Elixir’s build tool with the command mix test. The tests written in documentation run right alongside tests that I write myself in a more traditional way, should I have any.
On the other hand, if a test fails, I’ll get a nicely formatted result telling me exactly what went wrong — again, all built by doctest. For example, if I change “DELETE shopping eggs” to “DELETE shopping x,” we get a failed test:
And, thanks to the nicely integrated, first-class support for tests in documentation, I’m given some really good detail on what went wrong.
In this day and age, any language that isn’t embracing tightly integrated documentation and testing as a feature is falling behind. Elixir itself is now 11 years old and is based on Erlang, dating back to 1986. It’s a fast, lightweight language suitable for the modern era in all regards — it’s robust, distributed by nature, “edge compatible,” and has kept up with modern programming demands — including first class, integrated documentation and testing.
Automating operations
We can apply all the same ideas to DevOps. Every day, our ops team deals with server incidents, downtime, environment configuration, server drift, data or IAM migration, and a host of other frequently menial tasks. Most of these seem to fall into the bucket of tasks that require some very specialized operations knowledge. Too often, that knowledge is locked in one person’s head — which means, if they’re out of office, we’re out of luck.
There are other consequences, too. For instance, when our team grows, new teammates have no choice but to learn through a “trial by fire,” as they pick up knowledge on the job. Customers often suffer when nobody knows how to bring a server back online Sunday evening, when our Ops guru is out to dinner with family and the “off hours” team is frantically trying to figure things out on their own.
Wouldn’t it be great to have all those arcane steps documented? Of course, but who’s keeping those documents up to date?
What if the document was a “executable document” that actually runs the commands so that we use it every day, and naturally keep it up to date? With such a tool, just about anyone would be able to follow along and get the server back online.
It starts with runbooks. Runbooks are a collection of documented procedures that explain how to carry out a particular process, be it starting, stopping, debugging, or troubleshooting a particular system. Where it really comes together is with executable runbooks.
Executable runbooks to the rescue
Traditional runbooks usually take the form of a decision tree or detailed step-by-step instructions.
Modern implementations are embracing automation. Along with well-defined processes, operators can execute pre-written code blocks or database queries against a given environment. It’s easier than it sounds which is why it’s a shame so few teams are doing it. And as the DevOps landscape is evolving, we’re seeing lots of nifty solutions. Personally I love how GitLab has embraced executable runbooks, integrating them directly into the GitLab CI/CD pipeline toolset.
It works, and GitLab eats their own dog food (so to speak). You can check out GitLab’s own runbooks to see how they leverage it to… well, basically run GitLab — supporting their Infrastructure Reliability Engineers and Managers who are starting an on-call shift or responding to an incident.
One approach — the one GitLab uses — is to use Jupyter Notebooks and the Rubix library to write runbooks, documenting processes in a mix of informational instruction and executable code.
AWS has a workshop that demonstrates how to you can build an executable incident response runbook. Here’s a quick example of the runbook built in the workshop that shows how we can freely mix documentation and executable code:
As you can see, the interactive Jupyter notebook mixes written documentation along with executable code, making it easy for someone to step in and diagnose a problem — and keep those critical operations procedures well documented.
With the Rubix library, we get easy integrations to the command line, Cloudwatch, Elastic, databases and Kubernetes:
You can see when deployments occurred in your cluster by checking replica sets:
Or maybe check latency and deployment time in Cloudwatch:
You can run database queries, manage your Kubernetes cluster, and a lot more. Bottom line — you can quickly document your entire operations workload and provide point-and-click convenient commands to get the job done, all in one place.
Automate the things: Do less, accomplish more
By leveraging living, executable documents we can save a tremendous amount of time. We get the double benefit of not having to create static documentation, but instead build our actual automation and documentation together — plus, we benefit from the long term time savings of an efficient, easy-to-follow process.
This all goes a long way toward improving reliability, too. Humans are terrible at repetitive processes — be it manual testing or remembering which combination of commands restarts a downed server. Creating automation also creates repeatability and reliability.
On top of all this, we’re sharing knowledge, improving our team’s capabilities, and eliminating risky dependencies on a single team member.
It’s a tremendous win from every angle. I hope we’ll see these practices become much more commonplace. It’s no longer leading edge, it’s best practice.