<!--
.. title: Production Systems and Submarines
.. slug: production-systems-and-submarines
.. date: 2025-12-20 22:44:29 UTC
.. updated: 2025-12-20 22:44:29 UTC
.. tags:
.. category:
.. link:
.. description:
.. type: text
-->

I was watching _K-19: The Widowmaker_ recently. It's one of the best military
movies. Starring favorites Harrison Ford and Liam Neeson, _K-19_ provides
interesting insights into the operations of a submarine, albiet one from 1961.
It also got me thinking about production systems.

<!-- TEASER_END: Read more -->

Any production system is a living, autonomous creature. There may be other
creatures around it, like staging, pre-production, lab, etc., but they are also
independent and autonomous. A pipe leaking in one submarine does not
necessarily leak in another. Similarly, the world any system inhabits will be
its own. There may be similarities and uniformity in design and implementation
but the operating reality diverges quickly.

Every system lives its own life, from first day of operation to the last. It's
important to treat it with due care. A good principle is to treat all systems
as a production system, even if it's a "lowly" development environment. The
primary reason is that systems diverge. For example, production systems are
patched more frequently than development systems. Or development systems get
upgraded to newer versions of software and hardware more frequently than
production systems. Every new piece of code, configuration, and data is first
deployed in development systems and by the time it reaches production,
development has already moved on to something newer. By treating them all with
the same level of care and respect allows us to see them as independent. This
is a very important concept to understand.

You also need people experienced with operating each system to remain operating
it for as long as possible. Constant change of personnel creates loss of
expertise, experience, and knowledge, massively increasing risks to systems.
These personnel also need to be trusted to provide timely and accurate feedback
to all stakeholders about their operating conditions. If anything tingles
intuition or appears out of the ordinary, even without solid evidence, it must
be surfaced, tracked, and resolved with confidence.

Making changes to a production system is always risky. We mitigate risks by
safely making changes to low-impact systems first. We learn from these changes,
hoping they prepare us well for the same changes in production, that we will
discover all that can go wrong or all that will succeed. This is similar to a
person learning to swim. They start first with a shallow pool, eventually being
able to swim in the ocean. Since production is unlike every other system,
changes applied elsewhere will not necessarily behave the same here. Our focus
must be more on being prepared to deal with inadvertent consequences and less
on the confidence that there will be no (or even known) consequences. Just like
a swimmer needs to deal with forceful waves in the ocean moreseo than in a
pool.

Some ways to prepare to deal with consequences are,

- Size of change
- Impact radius
- Document assumptions
- Slow down
- Visibility
- Verification
- Rollback
- Live in reality

Keeping the size of change small is crucial. We must know exactly what is being
changed. This enables us to better understand the effects it could have. We
should identify the source of change (external vendor, in-house agent, etc.).
In the event of unexpected incidents, we know where to turn to for help. Think
how the current watch on a submarine would ask their superior officer for
guidance when something goes wrong. If they had made only a small change, it's
easier for that guidance to be correct.

Impact radius means understanding the graph of dependencies. Once we know what
is being changed, tracing its dependencies will give us more information on
what is affected. Just like in code reviews where the reviewer must not just
look at the diff but also think about the surrounding code, impact radius must
consider the entire surface before narrowing down.

We should list down assumptions about the part of a system being changed and
also the change being applied. For example, if you were patching sshd on a
Linux host, what are some assumptions you could make? That we have another way
of accessing the system if something goes wrong, like an out of band access
mechanism. Chase down that assumption and make sure it's working properly.
Another assumption could be that patching is not going to introduce any new
features or deprecate old features. Ensure that's the case by reading release
notes. Document all assumptions and their substantiating evidence.

Slowing down when planning the change helps in identifying assumptions and
their consequences. Think deeper and think outside the box. Listen twice as
much as speaking.

Visibility means to have continuously deep knowledge of the behavior of a
system, the change being applied, and all assumptions. This can be provided
with logs, metrics, traces, synthetic tests, probes from different
perspectives, etc. A submarine has gauges and other means to have visiblity
into various systems.

Verifying that the change succeeded is of utmost importance. Equally important
is to verify that no undesirable side effects occurred. We must know before
applying the change what success looks like and how is failure defined.

Be prepared to rollback the change. This also includes defining the impact of
rollback, documenting assumptions, and verifying the rollback succeeded. In
other words, the rollback (or anti-change) must have the exact same steps
followed as the change.

We must solve what's of immediate concern. The mission is important but to
complete it we have to deal with realities. Realities have priority over any
prepared plans, processes, and procedures. Live the reality the system is in.
We can wish that reality was different, more conducive to our goals and
requirements, but that will not always be true. It's important to recognize and
react accordingly.

There will be times, hopefully far fewer, that things will not work out well.
These can be mitigated with enough preparation and right-thinking. In _K-19_ we
see many unfortunate incidents, sacrifices, and heroics. Thankfully, most of us
don't deal with such harsh realities when working with information systems. But
that doesn't mean we can't learn from them.