I was watching _K-19: The Widowmaker_ recently. It's one of the best military movies. Starring favorites Harrison Ford and Liam Neeson, _K-19_ provides interesting insights into the operations of a submarine, albiet one from 1961. It also got me thinking about production systems. Any production system is a living, autonomous creature. There may be other creatures around it, like staging, pre-production, lab, etc., but they are also independent and autonomous. A pipe leaking in one submarine does not necessarily leak in another. Similarly, the world any system inhabits will be its own. There may be similarities and uniformity in design and implementation but the operating reality diverges quickly. Every system lives its own life, from first day of operation to the last. It's important to treat it with due care. A good principle is to treat all systems as a production system, even if it's a "lowly" development environment. The primary reason is that systems diverge. For example, production systems are patched more frequently than development systems. Or development systems get upgraded to newer versions of software and hardware more frequently than production systems. Every new piece of code, configuration, and data is first deployed in development systems and by the time it reaches production, development has already moved on to something newer. By treating them all with the same level of care and respect allows us to see them as independent. This is a very important concept to understand. You also need people experienced with operating each system to remain operating it for as long as possible. Constant change of personnel creates loss of expertise, experience, and knowledge, massively increasing risks to systems. These personnel also need to be trusted to provide timely and accurate feedback to all stakeholders about their operating conditions. If anything tingles intuition or appears out of the ordinary, even without solid evidence, it must be surfaced, tracked, and resolved with confidence. Making changes to a production system is always risky. We mitigate risks by safely making changes to low-impact systems first. We learn from these changes, hoping they prepare us well for the same changes in production, that we will discover all that can go wrong or all that will succeed. This is similar to a person learning to swim. They start first with a shallow pool, eventually being able to swim in the ocean. Since production is unlike every other system, changes applied elsewhere will not necessarily behave the same here. Our focus must be more on being prepared to deal with inadvertent consequences and less on the confidence that there will be no (or even known) consequences. Just like a swimmer needs to deal with forceful waves in the ocean moreseo than in a pool. Some ways to prepare to deal with consequences are, - Size of change - Impact radius - Document assumptions - Slow down - Visibility - Verification - Rollback - Live in reality Keeping the size of change small is crucial. We must know exactly what is being changed. This enables us to better understand the effects it could have. We should identify the source of change (external vendor, in-house agent, etc.). In the event of unexpected incidents, we know where to turn to for help. Think how the current watch on a submarine would ask their superior officer for guidance when something goes wrong. If they had made only a small change, it's easier for that guidance to be correct. Impact radius means understanding the graph of dependencies. Once we know what is being changed, tracing its dependencies will give us more information on what is affected. Just like in code reviews where the reviewer must not just look at the diff but also think about the surrounding code, impact radius must consider the entire surface before narrowing down. We should list down assumptions about the part of a system being changed and also the change being applied. For example, if you were patching sshd on a Linux host, what are some assumptions you could make? That we have another way of accessing the system if something goes wrong, like an out of band access mechanism. Chase down that assumption and make sure it's working properly. Another assumption could be that patching is not going to introduce any new features or deprecate old features. Ensure that's the case by reading release notes. Document all assumptions and their substantiating evidence. Slowing down when planning the change helps in identifying assumptions and their consequences. Think deeper and think outside the box. Listen twice as much as speaking. Visibility means to have continuously deep knowledge of the behavior of a system, the change being applied, and all assumptions. This can be provided with logs, metrics, traces, synthetic tests, probes from different perspectives, etc. A submarine has gauges and other means to have visiblity into various systems. Verifying that the change succeeded is of utmost importance. Equally important is to verify that no undesirable side effects occurred. We must know before applying the change what success looks like and how is failure defined. Be prepared to rollback the change. This also includes defining the impact of rollback, documenting assumptions, and verifying the rollback succeeded. In other words, the rollback (or anti-change) must have the exact same steps followed as the change. We must solve what's of immediate concern. The mission is important but to complete it we have to deal with realities. Realities have priority over any prepared plans, processes, and procedures. Live the reality the system is in. We can wish that reality was different, more conducive to our goals and requirements, but that will not always be true. It's important to recognize and react accordingly. There will be times, hopefully far fewer, that things will not work out well. These can be mitigated with enough preparation and right-thinking. In _K-19_ we see many unfortunate incidents, sacrifices, and heroics. Thankfully, most of us don't deal with such harsh realities when working with information systems. But that doesn't mean we can't learn from them.