Is Observability the New Monitoring?
What is observability? What is the difference between observability vs monitoring? – Since rejoining Nastel, after a long period away in the cloud, I’ve been wondering why people are now talking about ‘observability’ when they used to talk about ‘monitoring’. Why is there a new term? A new piece of technical jargon. I’ve been reading up on it and it seems that each vendor describes it slightly differently, and amazingly that difference is the key feature that makes their product different to everyone else’s. Some are talking about Java byte code instrumentation, some are talking about the complete absence of instrumentation – inferring things from the outside, some talk about Artificial Intelligence, some about DevOps. Some think observability is seeing the end to end application whereas monitoring is seeing the low level IT.
Why is it not just ‘monitorability’? Why not “application monitorability” or “CPU observability”? To find the “true” answer, I went back to the Latin etymology:
“Monitor” comes from the Latin “monere” meaning “to notify or alert”.
“Observability” comes from the Latin “observabilis” meaning “watch over, attend to, guard” from ob “in front of, before” + servare “to watch, keep safe.”
This highlights a significant distinction. Monitoring is about telling you when something has gone wrong. Observability is about making sure that something is working. It is fundamental in the paradigm shift from operations monitoring to Site Reliability Engineering (SRE).
Consider what happens when your house catches fire. At some point the fire gets so bad that your neighbours can see the flames, so they call the emergency services. A switchboard operator directs them to the fire station, and they send out a fire engine who come as quickly as they can and put out the fire. So, monitoring and alerting has handled and stopped the problem, but you’ve still lost half your house, all your photo albums, maybe someone was hurt, and it cost a lot of money.
Alternatively, with SRE, you constantly “watch over” the house to be “in front of” a problem “before” it happens. You do regular maintenance, have smoke detectors, heat detectors, and sprinklers. You watch the whole house and ensure that it is fine all the time so that it never gets to the point of catching fire.
The term “observability” comes from control theory where observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. And yet some people say that observability is a culture of instrumenting systems and applications to collect metrics and logs. Well, yes, certainly the more key indicators that you can provide to monitor for, then the more likely you can observe the state of the holistic system. But strictly with observability you don’t need instrumentation. You should view the external state of the system and infer potential issues from there. There are key indicators to keep an eye on:
Latency – time to service a request
Traffic – demand placed on system
Error rate – rate of failed requests
Saturation – which resources are most constrained
Watching the current state of these indicators, and how they change over time will enable you to maintain the availability and reliability of the system, rather than relying on alerts when a failure has happened.
Interestingly, over the last 25 years, Nastel’s product set has evolved as this approach to application and middleware availability has changed. We began with MQControl (now Navigator) which responded to events. It could raise an alert if a queue was full or a channel was down. Then we released AutoPilot which continuously watched individual metrics such as current depth. It did statistical analysis of trends to see whether a problem was coming. More recently Nastel developed XRay. Nastel XRay dynamically builds a holistic view of the entire system, capturing data as components of a transaction pass through different servers and middleware(s), stores this in a time series database cluster and runs analytics in real-time to identify trends, fixing issues before they occur. It gives “observability”.
So, does observability replace monitoring? Does XRay replace AutoPilot and Navigator? The answer is no. They’re complementary. All are needed – events, polling, and end-to-end tracking are all needed to be able to identify trends and potential issues in the overall business application, then drill down to the technical root cause and fix it. This is why Nastel has now released Navigator X, a combined solution including the features of Navigator, AutoPilot, and XRay. See here for more information.