Chaos Engineering: Learning from Organized Chaos
December 05, 2019
Developing custom software can be a challenging task, even more so if it means going back to the drawing board to correct failures and issues. As far as software development goes, testing is a must.
Mistakes happen, although developers program machines they aren’t one. However, what happens when a little mistake turns out to be the Godzilla of slips? The only solution is an all-out assault on the problem to reverse the situation. If you’re wondering how can you minimize the blast area in the event of a similar undetected issue, this is when Chaos Engineering comes into the picture.
First introduced by Netflix – yap, you read right – one of the largest subscription and streaming services today, this world-famous company introduced Chaos Engineering, a method developed to deliberately injure the system. This brilliant approach happened soon after a big database incident in 2008 which caused a three-day crisis, preventing Netflix to ship DVDs. So, in 2011, they migrated the company’s monolithic on-premises stack to a cloud-based architecture on AWS, preventing future meltdowns. Netflix created then Chaos Monkey, a tool developed to randomly create failures in different stages throughout the system. This is return allowed developers and engineers to quickly understand main failures, how to create them and more importantly, how to create better and tougher software that could brush-off a similar problem.
The approach to chaos engineering can be described as a flu shot. Maybe you’re thinking that its bonkers to deliberately inject something wounding to prevent damage. Yet, this method works, on people and cloud-bases systems as well. Chaos engineering focuses on hindering the system with surgical like precision, so it can be tested for weakness, analyzing how the system deals with the infection. This, in turn, can benefit companies tremendously as a way to prepare for potential problems that can paralyze the business. Take into account that several system failures can happen, such as application, network, infrastructure, and dependency failures, the list goes on. After all, if your system goes haywire due to lack of testing, it’s like going from easy peasy lemon squeeze to stressed depressed lemon zest in no time, with good reason.
It goes without saying that chaos engineering is a carefully planned experiment. The goal isn’t purely to test the system and verify its weaknesses. As stated by Casey Rosenthal, former engineering manager on Netflix’s Chaos Team, current software systems are too complex to be completely understood. So experimenting isn’t just a way to test but rather an approach that allows engineers to generate new insights and gain valuable knowledge.
As a way to better understand and discover issues, chaos engineering follows four principles that can be defined by key steps:
Steady State: You must verify and measure your system’s steady state. The goal here is to know if the system is performing as it should. These metrics will give you a good idea if there is anything crucial to tackle and if there are any major flaws. What would happen in this moment if your system failed?
Developing a hypothesis: To run an experiment a hypothesis is needed. After all, your testing to determine if the outcome that is expected to happen, really does. In other words, will X equal Y or W? Remember, your testing your Steady State.
What could happen in the real world: This is a simple step. The object is to reproduce scenarios that can disrupt your system, common events that can happen at any time, such as a database or virtual machine crash. Take a good look into your system, determine its weaknesses and ponder if something would go wrong, what would you do and what would the immediate steps be?
Proving or disproving the hypothesis: This step focuses on comparing the steady-state metrics to those after the disturbance was added to system. The result your looking for will be on finding difference in the measurements. If this happens, your experiment has turned out as it should. The next steps will be toughing up your system, to avoid any possible issues in the future.
DevOps revolves around continuous improvement, continuous delivery and constant releases. The introduction of the chaos principals became a great way to test system failures and a method to uncover potential flaws, becoming the go-to testing choice in DevOps environment. Adding continuous chaos to the DevOps culture is all about embracing preparation and prevention, leading to more efficient and stronger applications.
At the end of the day, chaos engineering is a modern software development method that works towards uncovering needed improvements while at the same, gaining important knowledge that can be applied in future. It’s all about discovering the “what-if” scenario.