These days I hear the terms
Observability, Monitoring, and APM, or Application Performance Management
thrown around seemingly interchangeably, but these terms actually mean quite different
things. So let's dive in head first and see an example of how exactly these things differ. So to
start I'm going to start with kind of a Java EE application, it's kind of old school, we'll
go back you know maybe a decade. And let's say that we've got some components in this Java EE app
that actually power it. So something important to remember here although we might be using a SOA, or
service oriented architecture, this is not exactly microservices. So they're not communicating over
Rest APIs. So you have some inherent advantages here, for example you can take advantage of
like the framework the Java EE framework to output log files which will probably all come
out into the same directory and the timestamps match up so things are good. In addition, you
could take advantage of something like an APM solution which is kind of like a one size fits
all set and forget so you install it and it'll kind of get rich analytics and data and metrics
about the running services within the application. So essentially what we've done is we've made
our system observable so that you know our Ops teams were then able to kind of look into
it and identify problems and figure out you know if anything needed to be done. So for the
business objectives back then this was essentially good enough, but this tends to fall apart very
quickly when you start to move to a more cloud native approach where you have multiple run times
and multiple kind of layers to the architecture. So let's say we have an example app here. So we'll
say we'll start with node as a front end. Let's say we also have a Java backend application. And
then finally let's say we also have a Python app which is doing some data processing. So let's
see how these things work with each other so the front-end app probably talks to the Java app
and also the Python app for some data processing. The Java app probably communicates with a database
and then the Python app probably talks to the Java app for kind of crud operations. So this is kind
of my quick sketch, kind of a dummy layout for a microservices based application. You can take it a
step further and even say that this is all running within Kubernetes. So we've got these
container-based applications running in a cluster. So immediately the first problem I can
see here is that with multiple runtimes we now have to think about multiple
different agents or ways to collect data. So instead of just one APM tool we might have
to start thinking about pulling in multiple so how would we con consolidate all
that data right so that's a challenge. In addition, let's think about things like
logging. So each of these runtimes probably outputting logs in a different place, and you
know, we have to figure out how we consolidate all those. Maybe we use a log streaming service.
Regardless you can see the complexity starts to grow. And finally, as you add more services and
microservices components to this architecture, say a user comes in where try to actually access
one of these services and they run into an error you need to trace that request through the
multiple services. Well unless you have the right architecture infrastructure in place,
you know something like headers on requests, maybe a way to handle web sockets, things are
going to start to get messy and you can see how the technical complexity grows quite large. So
here's where Observability comes in and actually differs, and differs itself from kind of standard
APM tools. It thinks about the more holistic cloud cloud-native approach for being able to do
things like logging and monitoring and that kind of thing. So I'll say there's three major
steps for any sort of Observability solution. We'll start with the first one we'll call
it collect, because we need to collect data. Then we'll go to monitor, and we'll talk about
this because this is you know part of monitoring. And finally we'll end with analyze, kind of doing
something with the actual data that you have so with the collect step, you know first thing let's
say that we actually made our system observable. So the great thing is with Kubernetes you get
some CPU memory data automatically. So let's say we get some of that, we get some logs from the
application all streaming to the same location and let's say we even get some other stuff like
high availability numbers or average latency, you know things that we want to
be able to track and monitor. So that brings me to my next step.
So once we have this data available we need to be able to actually do something with
it, at least visualizing it maybe if we're not actually even solving problems yet what do
we do with this data. Well maybe we create some dashboards to be able to monitor the
health of our application, and say we create multiple dashboards to be able to track different
services or kind of different business objectives, high availability versus latency, that kind of
thing. Now the final thing that I want to talk about here is what do we do next. So say we found
some bug in the application by kind of looking at our monitoring dashboards and we need to dive
in deeper and fix the problem with the node app. Well the great thing about that is an
Observability solution should allow you to do just that, it allows you to actually take it even
a step further because these days with Kubernetes you're getting a lot of that information from the
Kubernetes layer. So this is something I want to quickly pause and talk about. so with APM tools in
the past they were really kind of focused on kind of like resource constraints, CPU usage, memory
usage, that kind of thing. These days that's been offloaded to the Kubernetes layer, so you know
Observability kind of took APM and evolved it to the next stage, pulled it a step up and
enables our users to focus on things like SLOs and SLIs, Service Level Objectives
and Service Level Indicators. So these will enable you to actually focus
on things that matter to your business. So things like making sure that latencies
are low or that application uptime is high. So I think that's kind of the crucial three
steps for any sort of observability solution. Let's take a step back again. These
things can be hard to set up on your own with open source projects and capabilities
pulling all the different things together, so you might be looking at an Enterprise
Observability Solution and so when you're comparing competitors and looking at building
out your enterprise observability capability I would look at kind of three main
things. Now let's start with automation. Now every step of the way we need to make sure
that automation is there to make things easier so let's say that our dev team pushes out a new
version of the node app and go from v1 to v2. Now let's say they inadvertently introduced a
bug. Instead of making a bulk API call they now make individual API calls to the Python app. So
in our monitoring dashboard our Ops team's like oh guys something's wrong, the DB app is getting a
lot of requests what's going on? Well you need to be able to kind of automatically go back and trace
through the requests and identify what happened. That actually brings me to my second point as
well, which is context. It's always important, I can spell, to have that context. So automation
is important here because when upgrading to the new version a node you want to make sure that the
right agent is automatically installed and kind of the instrumentation is in place so your
dev team doesn't quite have to do that, and as new services get added you want your monitoring
dashboards to be automatically updated as well. And that context is extremely crucial as with
this example we needed to be able to trace that request back to the source of the problem. So once
we've traced that request back to the source with that context that we have the third step here
and I think probably one of the most important is action. What do we actually do now? And that
brings me to my last step here the analyze phase, which remember we talked about was
kind of an evolution of traditional APM tools to kind of the the way that
Observability tools implement that today. So when you get to this step you'll probably want
to look at maybe the SLIs within the node app. Maybe dive in deeper, right. So maybe you look
in and you identify that you need to look at application trace logs. So you look in the trace
logs and you identify some problems and you figure out what the what the fix is you tell it to your
dev team you know maybe the last step here is fix and then rinse and repeat for any other
issues that might come up in the future. So I think Enterprise Observability is extremely
crucial here when we're kind of looking at the bigger picture because it's not
just about having the individual pieces, which again like I said might be quite hard
to set up with purely open source approaches, but you want to think about automation to make
sure things are kind of set up seamlessly to reduce the overhead on your side. make sure you
have context to be able to see how services work with each other maybe even generate things like
dependency graphs to see the broader view because you might not always have a light board like
this to see the architecture so cleanly. And finally being able to take action when you do find
a problem. So making sure that your Observability solution has a way to automatically pull together
data from multiple sources, multiple services, and then figure out what's valid and necessary
for you to be able to make that fix happen. So IBM is invested in making sure our clients can
effectively set up Enterprise Observability with the recent acquisition of Instanta.
To learn more about the acquisition, or to get a showcase of the capabilities be sure
to check out the links in the description below. As always thanks for watching our videos. If you
liked the video or have any questions or comments, be sure to drop a like and a question or
comment below. Be sure to subscribe and stay tuned for more videos like
this in the future. Thank you.