Performance Monitoring in the Cloud Age
Cloud – Business applications have continued to become more complex, and the tools we all rely on to monitor performance have not always kept up.
The basic model that APM relies on, starts by capturing performance metrics and then using the knowledge developers and administrators have about the how these metrics can be analyzed to describe the performance of specific business processes. This model works best when large numbers of business transactions follow the same exact process. But with every level of automated decision making that gets added to the process, many more different pathways become possible.
This means that it becomes harder to compare transactions against each other, and this allows for critical situations to either be missed or non-critical situations to be identified incorrectly as critical, creating an ever-increasing volume of false alerts. This then in turn requires many more different situations to have to be analyzed, and you quickly reach a point where it becomes impractical to go through every potential use-case and build a model to consider it. And of course, with every model you do create, you need to be able to continually test it against every change in your environment.
Today’s APM based performance monitoring systems suffer from what I like to call “complexity fatigue”, where the sheer complexity of the task outweighs the benefit.
A number of different approaches have been postulated to help eliminate this complexity fatigue, including using machine learning driven artificial intelligence to automate the process of identifying new pathways and the collection of the appropriate metrics and then the creation of the appropriate algorithms to look for signals in the data that indicate potential issues. But all of these models continue to suffer from the same underlying issues, namely that the complexity of the environments still make it too hard to spot real events in the sea of false positives, and that without human driven intelligence to identify fundamental business needs the AI driven process cannot derive business context. AIOps (as these process have been badged by the marketeers) is definitely a powerful way of thinking, but is may not be enough to break the cycle of complexity fatigue.
There is another technique that has been helping many enterprises to solve the complexity fatigue issue, and that is simply to use the configuration knowledge already embedded within the messaging middleware layer of the application stack to provide an abstracted view of each business flow (or transaction). Messages flowing between all of the subcomponents within the application stack already contain all the knowledge needed to visualize how each user’s request is processed. If you can just overlay the performance metrics of each system that a transaction uses onto this visual map of the transaction, you have a map that allows each transaction to be compared to the historical record of similar transactions and for the performance to be compared and contrasted. And with this model it doesn’t matter if each transaction took a different logical path, as now every pathway can be compared without the need to generate new logic.
Using an abstracted view of a business through messaging middleware intelligence also allows a simpler application of machine learning artificial intelligence to compare what is happening in real-time to the historical record of previous transactions, and this allows subtle variances in performance to quickly identify potential events before a user would notice them, in effect predicting performance issues early enough that automated changes can be implement to avoid any impact on the users.
Most companies use a number of different messaging middleware platforms, such as IBM’s MQ, Tibco’s EMS and Kafka, along with systems such as RabbitMQ or ActiveMQ. And these may run on a number of different cloud platforms as well as legacy datacenters and even mainframe environments. Any solution that uses messaging middleware as a source of intelligence should be able to run in all of these environments, especially if the goal is to reduce complexity fatigue.
There is only one company that provides such a system, and that is Nastel Technologies. It’s taken 25 years of investment to get to the point where this system is practical to reduce complexity fatigue. The result is a system that can dramatically improve delivery times for new applications and changes to existing applications and eliminate war rooms and performance events. Almost all of the F-100 are now using Nastel Technologies solutions in their enterprises.
APM is not dead, AIOps is still interesting, but if your goal is to provide a better business outcome in the most efficient and effective manner, you should look at Nastel Technologies as part of your performance monitoring process.