Why Monitoring is being supplemented by Observability
The classic methods of monitoring computing platforms created a vast array of odometers and graphs that displayed the changes over time of critical system parameters.
The thinking has always been that if you can measure key system performance and capacity parameters, then you can build up a picture of performance from which you can then calculate and predict performance issues.
The challenge is that knowing how each system is performing is no longer a viable way to measure the performance as experienced by the users. With the complexity of systems growing exponentially through, not just raw capacity and throughput, but also by intelligence, the systems themselves dynamically create the most effective pathways for user requests to be serviced through. Machine Learning AI has made it even harder for operators to see how business flows, because as the systems themselves learn and improve, they make use of interconnections that the operators had not predicted.
So when an issue happens, trying to discover the root cause is actually more complex than ever. Today for the first time in the history of computing we are seeing the time to repair problems actually taking longer.
Ask your CIO about war room scenarios and watch the hairs on the back of their necks raise as they think of the painful daylong conference calls they are now experiencing with increasing frequency. These multi department meetings consist of experts from every team spending hours describing why the issue at hand is not actually their problem, and progresses from formal presentations to stressed senior executives having to explain why they don’t really care who caused the problem, they just need it to resolved in a timely manner.
The dashboards of odometers and charts that show how each system is performing do not make this kind of a problem go away. What is needed is a trace of the flow of the user request through every system, overlaid with relevant performance data, and compared to what a trace of known successful flows looked like. So that any deviance can be immediately identified.
This kind of a trace will often show that it’s not a single point of failure, but instead many points that are just running a little hotter (or colder) than usual. This leads to an overall missed service level.
Being able to quickly view a trace of a business flow (or transaction) and then compare this to similar historical flows (tracking) is critical If you want to eliminate war rooms and effectively reduce the complexity of solving complex issues.
Being able to “observe” how users experience your entire business application isn’t a replacement for monitoring, but it builds context and an abstracted view of performance that takes operations to the next level.
Nastel capitalizes on our understanding of messaging middleware to abstract an additional dimension of knowledge about your application stack, and we use this combined with our ability to monitor every type of machine data to deliver observability. Reusing the brainpower you have already invested in configuring how your application stack connects internally, can save you a significant level of future investment, and can provide you with a much better operational process to maximize performance, user experience and availability.