- Good morning. Thank you for turning up so early. I'm very excited and surprised to see that you all managed to get out of bed. I'm gonna start by admitting
a little secret to you. One of my very first jobs
I got early on in my career was as a junior verification
and modeling engineer. I had no idea what this is. I knew two things. I knew it was programming, and I knew they were gonna pay me. I knew absolutely nothing
else about this job. Would anyone else like to admit that they've done the same thing? Fantastic. (laughing) That makes me feel a lot better. So what this job was actually about was building silicon
chips, not processors. They were actually for networking. But what's really interesting
about this process is these designs start off
looking a lot like software. You write some syntax in some text files and through a very long
and drawn out process, they get manufactured
into these silicon wafers which then become your chip. This process is extremely expensive. A conservative estimate might
put it at $5 or $10 million, and that's really being
conservative about it. So this is where my job came in. My job, as it turned
out, was to write tests for these designs in C++
because you want to get the design right the first time. That rarely happens, but that's the idea. And so we would incorporate
simulations of these designs into C++ code and write tests for them. Because this process of making these chips is so expensive and so drawn out, when you get a bug in the chip, you really want to make
sure that you can fix it and work out what's going on without having to change the chip. And so you lose, you don't
have a lot of the niceties that you have when
you're building software. You don't have any
debugger in the same sense. There's no running a debug build. There's no just adding some extra logs. What you do have though, is a huge amount of
statistics in the chip. Huge amount. Little registers that you
can read numbers out of and these numbers are
accumulating various bits of information about
what's going on inside. And this is all you have to debug anything that might go wrong in the chip. And you have to make sure
it's in the chip design from the start. There's no adding it in afterwards. This is not possible. So let's compare this to how we build C++. We write a similar set of code and the equivalent to
manufacturing I guess, would be putting it into production, so compiling it, testing it,
deploying it, shipping it. But how often do we think about how much this actually costs? It could cost a very
small amount of money. Is that actually quite expensive? How easy is it to actually change code that we've put into production? This is gonna depend heavily
on what domain you're in, but it's worth thinking about because I think even if it's really easy for you to update your code in production, you want to do it as least as possible. So what do we mean by production ready? We talk about this a lot,
especially in code reviews, for example. Is this code production ready? I wouldn't do this in production. Things like that. Maybe if it compiles,
it's production ready? I think we've all worked at
places where this is true. If it's tested, is it production ready? Well, one thing that production ready does not mean is bug free. You always have to account
that something might go wrong in production. So how about this. Observable. Think back to that chip. We made sure when we
were designing the chip that it was observable
so we could work out what goes wrong when it
has been manufactured. So this is what I want to talk about. How can we make our software observable so that when something
goes wrong in production, we can easily work out what went wrong and how we can fix it. So let's talk about instrumentation. It's in the title of the talk. Let's get a definition. Should always start with
a definition, right. Wikipedia says instrumentation
is a collective term for measuring instruments
used for indicating measuring and recording
physical quantities. Fantastic, that's quite useful to know. And even better, it says in the context of computer programming,
instrumentation refers to an ability to monitor or measure the level of a product's performance
to diagnose errors and write trace information. This is exactly what it is. But we should get a second opinion, so let's ask Urban Dictionary. Great source of all technical knowledge. Urban Dictionary says, has its
little example at the bottom, "Hey John, look at that
expensive instrumentation." And what Sue's actually talking about here is the fact that if
you add instrumentation to your software, it may have overheads, so it's actually expensive. So what I'm telling
you is Urban Dictionary is a great source for C++ programmers. So instrumentation allows us
to observe deployed software. We can monitor that it's working, we can measure the
performance of it as it runs, and we can diagnose errors when they occur and trace them back to what caused them. Specifically in this talk,
we're gonna talk about source instrumentation. So this is adding code to your source, adding features to your
code to instrument it. And this is built in to
your production releases. I really want to emphasize this. The instrumentation is
not something you just do in your debug build. It's something that's in your
product, in your software. Your software becomes instrumented
and runs in that form. And there are some
alternatives to doing this, but this is not what we're
gonna talk about today. The best form of instrumentation,
printf, I mean logging. Always gotta log. So this is what logging is for. At some point early in the morning, you will get a message
saying your software crashed. This will happen, prepare for it. And you will ask, what's in the log. And what they will tell
you is std:: bad_alloc. And you then have to work out why. And you might then question why you became a software developer. Maybe you start looking on jobs, on Stack Overflow for new
jobs, but a few hours later you'll probably ask them
to reboot the machine and maybe it will fix everything. Okay. Let's get a bit technical now. So I have this little example program that we're gonna work through in the talk. Just a basic shell of a C++ program, does some processing and
any exception that occurs in the processing is printed out. We probably all wrote little
snippets of code like this. But what we're gonna
focus on is this error that's coming out of the
code, and we can assume that it's coming out of
this process file function that we've read. And maybe reads a file and processes the contents of the file. So what we really need
is a lot more information about this error. This permission denied
message is not nearly enough information for us to
work out what's gone wrong, so we could add some context. And this is where we start
talking about logging. Maybe we use cout just to print
some extra information out, and in this case wouldn't it
be useful to know which file we were accessing where
the permission was denied. Maybe in addition, we want some context about why this operation
is even occurring, what user caused the operation, what connection the request came from. Of course, this will vary depending on the sort of software you're working on. And then on top of this we can
clean up this error message that we're sending to the
user through the exception because the user doesn't care that permission for
something has been denied. What they care about is
that their request failed and maybe we tell them a
little bit of information. We don't expose too much of the internals of our application, but
this is a good start. Now we have a lot more information and maybe we can work out what went wrong. So these little bits of
writing stuffed to the screen at some point we'll think, okay, we're definitely logging
now so we should use a logging library, be
professional about it. So maybe we use something
off the Internet, maybe we get something to
come up, package manager, maybe we write one ourselves, maybe the company has
their own logging library. Whatever it is, you know it's
gonna look a bit like this. Maybe you get some ability
to use format strings, you can specify severity levels, and have lots of useful
features for writing log files and maintaining them. But this really isn't what's interesting, I think, about logging. The problem with logging
is us, the humans. We've made logo files to be human readable so that we can read them
and understand quickly how to fix a problem. And this is fine if you've
got a little log file or a couple of files to read. The problem gets increasingly
worse when your software grows or more people are using your software. You gradually become less and less happy because we don't scale, we can't read thousands of log files. It's not possible. And so we started to
imagine was this extra layer between our logs, which for
the purposes of this talk, we'll just call magic. And this processes all
this log data we've got and puts it into a nicer form
for us to be able to read. I'm not gonna go into too
much detail about this magic, but roughly speaking, what it refers to is a growing ecosystem of
software and services that let you process and
understand your logs better. And this huge amount of
software that does this for you, and if you're running your
applications on the cloud, then you can probably
just pipe all of your logs into some service and it
will sort them out for you. So what this gives you typically, is some ability to
search through your logs for particular errors
or a particular time. It might do some reporting
for you, produce some metrics, tell you how many errors
occurred on a particular host, for example, and maybe
it'll give you some alerting so you can only be notified when really important things happen. You're not constantly
reading through these logs. So this is a very rough
overview of this systems. The problem comes if we
want to use these systems all we have to feed them
is this human readable text that we munged together
from all this information. So the first thing a
lot of these systems do is process this data
into a much nicer format, give some structure to it. Fill out, for example, the
username and the IP address so we can search on these
things and we can index all of our logs and all
of our errors by user and say which errors
occurred for this user. These are typically
implemented in various ways, but the most common is this
mess of regular expressions you end up writing to parse
out the data from your logs. But this is insane. We already had those bits of data in a nicely structured
form in our application, and we merged them into a text format, which we then parsed back
into a structured format. What we really want to do
is just output the data that we had in a structured form. It's still human readable if we need to, but it makes it so much more easy for machines to process it. And this is what we want
to do, we want to automate the processing of this
huge mess of log data that we've acquired. And this is becoming very
popular in other languages. It doesn't seem to have
caught on so much in C++ yet, but I think it's something
that is worth mentioning because it allows us to eliminate all of this unnecessary work. And this idea of structuring our data brings us on to the
more interesting topic, in my opinion, which is tracing. Tracing is basically logging, but it gives a bit more information about what your logs mean. So a trace is typically
something that has a start and something that has an end. So an operation that
takes some amount of time. And the questions we're
interested in asking and answering with tracing is what caused the error. We want to build up a history
of what happened in our system so that we can trace back
the source of the error. And the other thing
we're interested in doing is looking at performance, so how long did something take. A good example of this is
strace, brilliant little utility which instruments some
program and logs out the system calls that
are used by the program. So if we look at this example
of catting some file out, it will tell us that open was called because we need to read the file. And later on we, call another system call to read the data out of the file. The first thing this tells
us is what actually happened. The system call, the arguments
and even the error message, the error code, sorry, which
is very useful in itself, but we also learn the
time that this occurred and how long it took so
we can start looking for bottlenecks in the code. So the way this tends to evolve is you start with having some
logging in your application. Maybe you have this process file function that we touched on earlier. It reads a file, opens a
file, and then reads the data and processes each line
of the file in some way. And we log it and we
log which file we read and we log which you user did it. Useful. Wouldn't it be nice if
we knew when it ended because then we know how long it takes. We know when our code
is finished with a file and it closes it off. So eventually we realize,
okay, now we're doing tracing, we're not doing logging anymore, so we'll use the library for that. And typically you get
these little utilities that look a bit like your log code, but they'll produce some
instance of some trace object for you and then you can end it when your operation is finished, and many of these tools will,
if you don't do it explicitly, the destructor will do it for you. So if your function finishes then the end trace gets written. And you can add the same
information as we did with the log. Any sort of arbitrary information you think might be
interesting, filename, user, IP address, you can add that as well. So let's look at an example
of how this might be output because this data doesn't
have to be output as logs, but it's quite common to do so. Maybe we get these file read
events with some metadata and then we get the end,
maybe no application. It opens up a couple of
files and processes them. Well, it's common to do
kind of like we looked at with the strace example
is to only emit one event or these two events, so we emit the event when the operation ends and
we emit all of the metadata and then instead of the start, we just record the time
that the event started. Alternatively, we could
do, record the duration. You get the same information
regardless of what you put out. Where tracing becomes really interesting is how you build up
relationships between operations. So if we have some connect operation, some user starts a
connection to our service, we can trace that. Then inside that connection, maybe we have to read some files, so we have another operation
within an operation. And so what we can do is
actually link these together through some scheme or another,
some incrementing integer or some sort of you UUID, and
we can tag the inner trace with the ID of the outer trace. So now we get this relationship
between the traces. You typically don't
have to worry about this because if you're using
a sufficient library, then it will have support for
building these relationships. So if you create your trace
and you create your inner trace from the first trace, then
they will get linked together through some identifier. What's even more interesting
is that you can often write logs to traces, so now
you can associate errors with particular operations,
and this cascades. So you have an error at the
very bottom of your application, you failed to write a file. You can then trace it all the way up to where the client invoked that operation that ultimately failed. But generally, this is where
the name tracing comes from 'cause you get to trace
the origin of your errors. Let's look a little bit
more at these relationships. We can trace errors from
client to root cause, but we can also see within an operation where bottlenecks are. Is one operation taking
longer than another. And this is actually essential for systems with any sort of concurrency in them. Because if we imagine two
concurrent streams of processing going on here, the interlinked,
sorry, the interleaved bits of one happen, then
bits of the other happen, then bits of the next one happen. So by building these relationships, we actually know the correct flow and we don't have to infer it from just reading through the log file. This log file isn't ordered,
it's just a complete mess of all the things that are
concurrently happening. So, pretty pictures time. A really nice way to
visualize these traces is with little timelines. We can, for example,
display all of the traces for a particular user. Every time a user causes
a file to be read, we can display that,
but what we can also do is then display the
connection for that user and we can draw the links between them. So we start getting this sort
of, it looks a little bit like a core graph, like the connection triggers the file read and
we can add the other data in the file alongside it
as well if we're interested in correlating the two together. This becomes even more
interesting when you think about software that is broken up
into multiple address spaces. For example, multiple processors or if you have a distributed application were parts of it are
running on different nodes. You can then link operations which occur in completely different
processes and those together, and you can say the cause
of this was actually something that happened somewhere else, not in my address space. So this has got a lot
of benefits for doing, just by adding, taking our logs, but just adding a little
bit more structure to them. That's all we're doing. So in a C++ application, we
care a lot about performance, so we need to be very
careful about the overheads which we incur. One thing we're particularly interested in is being able to disable our tracing, and this is some of the concepts of what you might do for your logs. You want to be able to turn them off, and when you turn them off, you ideally want them to
have no overhead, ideally. You can't always get this low, but with some systems you can. And then what we care about is of course, the overhead where the tracing's turned on because when it's turned
on is when it's valuable. And there's a few things we can do. A very common technique
is to sample your traces, so only actually emit information
for every 10th operation or every 100th operation. We could aggregate our trace
data, so instead of outputing all the details of every trace, we just output the count
of things that happen and we just output the total time that all of the operations took. This still has a lot of value
because we can then see, for example, the average
time that our operation took. And if you're thinking
this sounds a lot like what I get from my profiler,
then you'd be right. In a sense, a profiler that you might use for development time is
like a very specialized type of tracer for finding
performance bottlenecks. And then we can also build tracing systems but don't actually use logging but a much more efficient
version than formatting and writing out huge
blobs of text to files. We can imagine much better
optimized binary formats. If you think about the tool tcpdump, this is a tracing tool, fundamentally. It's tracing your network
activity in and out of a node and it has a very specific
format for storing the traces, the network packets, in a file. And if we go to look at the Linux kernel, they have a tracing file format as well. Linux has this very elaborate mechanism of adding trace points into the kernel so you can see what's happening. With this, the big problem
with any sort of tracing and any sort of logging
is that the overhead grows as your application does more things. If your application does 1,000 things, then you have to trace 1,000 times. If it does 10,000, you
have to trace 10,000 times, unless you're doing some
sampling, of course. So the overhead always
grows as your application starts to do more things or
you want to trace more parts of your application,
which leads us onto why we want to talk about metrics. What are metrics? They're just numbers, interesting numbers polled periodically. And a really good example of this is htop. Have you ever seen a screen like this? Little utility you can run and get it on most operating systems. This is page htop in particular, but fundamentally they're the same. You get a lot of really useful information about your system. You get your CPU usage, you
get the amount of memory that you're using, number of
processes that are running, the uptime of your server. Extremely useful. But notice the wide array of
different types of numbers. We've got duration, we've got counters, we've got absolute values like memory, and we've got relative
values like CPU usage. This is really useful. And typically what we want a metric for is the history of it, so if we
have memory usage of a node, when our bad_alloc occurs
in our production system, we're able to look at this
memory metric and see, well, something happened here, we should probably take a look at that. And then we can build
alerts on top of this. So say, if our memory increases
over a certain threshold, then tell us and maybe
if we're clever enough, we can find out why it broke and everything goes back to normal. The typical workflow for
system metrics at least, is very commonplace. This has been a technique
that's been around a long, long, long time. You collect the metrics from your servers, you store them somewhere,
and then you analyze them. And there are a huge number of systems that will do this for you, and again, if you're using some sort
of cloud provider, well, they also have a metrics
collection system you can use. What we want to do when
we're developing software is hook into this. We want to expose our own
metrics and have them collected and have them analyzed and alert on them and we want to get all the
same benefits that we get when we're monitoring our
infrastructure and our service, our temperatures and CPUs. Let's look at what a metric is made of. We give it a name, temperature,
some sort of count, number of things happened, and we tag it if we have
multiple versions of that metric. We have multiple hosts then each of which have some temperature sensor, then we can tag it and
say specifically which one this is a measurement for. And of course the value. And then the timestamp, which
we took the measurement for. This example happens to be open metrics. It's a evolving an open standard for passing metric data between systems. Let's get back to some code. That's why we're here. Same example as earlier,
little process file function. Well, in a metrics
library, and this library, it will be best to use
the same library that, the library that your
infrastructure recommends you use. So if you're using a particular
type of monitoring software for your infrastructure,
then they will probably have a C++ client that you can use
to expose your own metrics. What we can do is start
building metrics in, for example, maybe we
want to count something. Maybe we want to count the
number of times we read a file. So we can add a little
counter, we can add some tags and some metadata to do it,
and then we can increment it every time we read a file. The counter itself will
typically look like this. It will have some integer inside it, probably an atomic integer, and some function to increment the count, and a function to obtain the count. Not very complicated. The idea behind this counter though, is to keep that increment
as lightweight as possible. We want it to do as little as possible so when we add it into our code, so every time we read a file,
the overhead we're adding in to our application is
negligible and a lot cheaper than if we were to put
a log in the same place. Typically, this will vary
depending on the library and the infrastructure you're using, but how you collect these
counters will be through some thread that's running
periodically and picking up each value from each counter. You have some registry
somewhere in the library of all the counters and all the metrics, you run through each of them, pull out the value and publish them. This is quite heavyweight work. Maybe we're formatting
the metrics into text, maybe we're sending them
over some network socket or writing them to a file,
but it doesn't matter because we were only doing it
every time we pull the data. We're only doing it every five
seconds or every 10 seconds. That increment, the thing
that we actually put in the critical path of our
code is still extremely cheap. So having these counters
doesn't incur too much overhead. I'm sure some of you are thinking though, I can definitely do better than this. I know you're thinking it. It sounds like a really
interesting problem to really optimize that increment, to get it as fast as possible. Well, yeah, other people
have thought about it too and they thought it was really interesting and they wrote papers on it and they tried to standardize it. So if you're really interested
in how you can write really efficient counters,
there's a good paper for you to go and read. Let's look a different example. What about if we wanted to
count the number of times we read a line from a file. This is a much more frequent occurrence than just reading a file in its entirety. So this loop is critical performance wise, but it's still fairly heavyweight. We're pulling out a line from a file, processing it some way,
pausing it, so on and so on. But it's still quite a good, this makes it a good candidate
for adding a metric to because the increment
relative to what you're doing is still fairly lightweight. However, it's still a nonzero overhead and there will be
situations where the cost of incrementing a counter
is still an overhead to your operation, so
we have to think about is the information we're
getting valuable enough to warrant slowing down the code. Let's look at the data that we might get out of this counter. Say we're pulling it every five seconds as our application
starts up, counter zero, nothing's happening, and then
something starts happening and we start processing data,
the numbers start going up. This is meaningless. You can look at that and
really infer nothing other than the number went up a bit. What we really want to do
is visualize it of course, and now we get to see some
really interesting things. We can see roughly when processing starts and when the processing finishes. These flat areas are
where nothing's going on. And we can see roughly the
number of lines that we've read through each file, when this leveling off occurs. What's even more interesting
is when we host process the output from this
metric and for example, graph the rate that the
counter is increasing at. Now it becomes even more obvious when our operation is
starting and stopping. We can see it roughly takes,
the first one roughly takes 40 seconds and we can see very clearly that two operations
occurred and we can see the throughput, so we can
actually see the performance of our processing loop on the graph. And even better what we can do is we can see if the performance changes throughout the processing,
and this is something you wouldn't typically see
if all you do is collect the time your operation took and the number of lines you processed. You would get an average
over the whole operation. What you see with this counter is you see if the rate changes. So this excites me a lot. If we tag our metrics in a nice way, we can correlate what's
happening depending on different dimensions in our system. If we have multiple users, we
can see that one of the users in our system, when they begin requesting, affects the performance of the other one. And we can do other processing
to these numbers as well. We can graph the sum of
the rate for all the users in our system, and now
we learned even more. We learned that there's
some sort of startup, slower performance and
then after some time, we see the performance increase, so maybe this is an
effect of file caching. Once you've read the file
once, it gets stored in memory and so it's faster to
process it a second time, and you can see that
there's some sort of limit. So even though we have
two users running requests in parallel, there's some sort
of ceiling to our performance and all of this information
comes from adding just one counter to your loop that's doing your interesting processing. I think this is pretty cool. And we can go one step further with our metrics and our pretty graph. We can put this side by
side with our tracing data that we found earlier, and
now we start to fill in even more gaps in our knowledge. Just looking at the metric,
we don't necessarily know whether this is two distinct operations or whether it's one operation
that happened to dip in performance very drastically, but if we look at the
tracing data alongside it, we can verify that in fact, it
was two distinct operations, and from the metrics, we
can see the performance within the operations. I think this is really
valuable lesson to take away. Just very simple additions to your code can tell you so much information, but you do have to put a
bit of thought into it. It's not effortless. So, we're at the end. Nearly time for some coffee. I'm desperate for some coffee. What am I trying to say? Develop observable software. Think back to that chip I
talked about at the start. We had to make sure
that chip was observable so we could work out what
happens when it goes wrong. And I think there is a huge advantage in doing this with software as well. Try to debug in development
as you would in production so when something goes
wrong in production, you don't need to install your debugger on your production server. You don't need to install tracing tools or some other form of instrumentation. Your software is
monitorable and observable. Your software becomes its own debugger. But we have to take into account that while there is a lot of
information to be had there there is overheads in doing it, but it doesn't have to
be expensive as long as you're mindful of where you
use different techniques. We can log and we can log errors because we have to and we should. We can trace things at a
very coarse granularity that happen infrequently, and
then in our more hot loops, we can think about adding
some metrics instead so that there is less overhead incurred. There are always trade offs, and the techniques here
are very complimentary. You use them together, and this includes other
types of instrumentation. Just because you're adding
some instrumentation to your code, to your software,
doesn't mean you can't still use your debugger
or tools like strace or other instrumentation tools. That doesn't mean that you
can't use your compiler to add instrumentation as well. But with source
instrumentation you can choose what information to expose that is useful to your particular domain. So if you're writing some sort of video processing framework,
maybe you're interested in counting things like the
number of frames processed. If you're building a database,
maybe you're interested in the number of times a table is accessed or a row in a table is accessed. This is all possible when
you actually think about what instrumentation and what information you want out of your code. So with that, I'm gonna
thank you for coming and I hope you enjoy the
rest of your conference. (audience applauding) We have 10 minutes for questions
and there are microphones if anyone would like, or
I am around until Friday so please feel free to
come and talk to me. Yes. - [Man] So I've had to look
into tracing quite a bit myself and the availability of C++
libraries is poor I think. - I'm sorry, could just,
a little bit closer. - [Man] The availability
of a C++ libraries for tracing is poor at best, I think. And I'm not really aware
of any metrics like this. What's your experience? Have you got any suggestions? - So the question was, are
there any specific examples of metrics libraries and
tracing libraries for C++. I left this out of the
presentation sort of on purpose because as the gentleman
says, the choice is, while there is choice, there's
no sort of defacto standard. There's no, and this is even
true for logging libraries. There are thousands of logging libraries. Every framework has its
own logging library. Every company I've worked in has their own internal logging lobby,
there's no standard for it. But to directly answer your question, the library I've had most, the infrastructure I've used the most is a piece of software called Prometheus, and it has a number of C++ clients which let you expose
metrics to Prometheus. So it's very, the problem
is that these libraries aren't standard and
they're are often specific to the tool you're using,
which is unfortunate. And for tracing, there is a
evolving open source project called OpenTracing, and that
links into a piece of software called Jaeger, which is a
distributed tracing system and that has C++ clients. So those are two things you could look at. But this is an interesting point. I think we could do a lot
better to try and evolve some libraries, maybe not
necessarily in a standard context, but at least as a community
where there are tools which become the defacto standard. So if we use a library from
here and the library from here, we write some code, we
can use a common library to introduce metrics and traces. Yeah. Yes sir. - [Man] Yeah, I guess all three
of us had the same question. Can you say a little more
about this Jaeger library? Does it work with just
other processes across C++ or does it work cross language
as well as cross process? So my company, we use scripting language in addition to JavaScript,
in addition to C++, and having visibility across
all three would be great. - So the question was specific to Jaeger, and I guess perhaps generally
with tracing clients, is there any cross language
clients we could use so you can collect tracing information from different parts of your stack. So the answer to that is yes,
Jaeger specifically supports pretty much every language. Jaeger is an implementation
of the OpenTracing standard, and the OpenTracing standard, this OpenTracing has lots of clients for all different languages,
so you could put some traces in your C++ code that
actually call operations in a different language
and then still link those traces together. So I think you mentioned JavaScript. I'm not 100% sure, but
I would be surprised if there wasn't the JavaScript client. There's definitely things
like Python and Ruby and anything like that. (audience member speaking unintelligibly) - [Man] Okay. Alright. I guess just as a comment, we
effectively had to hand roll our own instrumentation library. And for those who might use Intel's TBB, they do have an enumerable
threads storage counter. It kind of does the atomic
thing you were talking about. - Yeah. - [Man] Yeah, so just as an FYI. - Yeah. - [Man] So thank you. - Any other questions? Yes sir. - [Man] Can you just spell Prometheus, how is it spelled so
that I can look it up? - The Prometheus, what is... - How do you spell Prometheus? P-R-O-M etheus. (laughing) It's the same as the film,
the alien film, Prometheus. It's a great metrics collection software. I quite like it myself. Yes sir. - [Man] Hey, I think I heard you say that you should be careful about
not putting instrumentation in everywhere. - Yeah. - [Man] And while it's
easy to agree with that, I think the recommendation
should be instrumentation is a key functionality of your software. You cannot operate the software
without instrumentation and you should put it
everywhere where it's needed, just like basic functionality
of the thing, right. If your thing is supposed
to compute a square root, you compute the square root. - Right. - [Man] If you're supposed
to run the software, you instrument it. - So the comment was
maybe the advice should be you instrument as much as you can and especially things that are important. Is that right? - [Man] Well, in a
former life I was an SRE, and the advice I would give any developer is you define the SLOs
and SLAs for the software that you're building and then
you build the instrumentation that you need to monitor
for those SLOs and SLAs. Period, right. And you want to instrument
a little bit more than that so you can troubleshoot and find out why you're in violation of your SLOs, but you start from what are
the properties of the software that you want to maintain or the system, and then you put all the
instrumentation you need to be able to maintain
those properties, right. That means you have to
buy an extra CPU and AWS or Google Compute Engine in my case, then go ahead and do it. - So this is an extremely
good point in that sometimes the instrumentation is critical or even a requirement of your software and so any overhead has to be acceptable and you just have to deal with-- - [Man] It's not overhead, right. Overhead is stuff that you don't need. In this case, instrumentation
is something you need or it's hard to have the
functionality of the system. - So this talk is actually
about half the size that it started out, and I had
a section about exactly this. The trade-offs, it went
into much more detail about the trade-offs
between different types of instrumentation, how
much overhead is good enough or little overhead, and when
you might use different things. When you need to use instrumentation because as a requirement
from a customer or you have, for example, an SLA, then
how can you make a trace as efficient as possible. Unfortunately, that turned
out to be far too much to cover in one go. Thank you for your question. Yes sir. - [Man] I was gonna
basically agree and I think, but with logging it's interesting I think, is that often, at least in my experience, is that you put logging where you need it to solve a problem. You don't put it for the future problems you're going to have. So it's usually too late. And I wonder whether tracing,
I suppose your comment is you put it everywhere
or for the SLA or however, but it's a very difficult problem for knowing exactly
beforehand where to put it. I think it's an interesting
topic for people to do. So you could do at kind
of a business level. So you say, right, okay,
but we'll put logging, tracing and metrics on
business terms if you like and not do it any lower. And this also depends on
what you're using them for. Are you using them to monitor
or for a debugging tool. Logging is often more a debugging tool whereas tracing is more
monitoring and metrics, I think. - So the summary of that
comment I think is that logging is often more geared towards things that you absolutely need to record, like errors and tracing, and specifically, the gentlemen said, errors
are logs of the things that you know you need
to record whereas traces are typically more for things
you think you might need to record for future issues
that you don't know about. And yeah, I completely agree. I think the more you
start to look at tracing as something distinct from logging, you realize that logging
is really just for errors. It's about adding context to
errors, maybe warnings as well, but that's really all you
should be using logging for and if you have some customer
requirement that you produce a log file in the format, then of course, you need to do some logging. But I think if you generally
all this sort of info logging, the debug logging, we
should be thinking about in a more structured way, perhaps looking at using
some tracing instead. And one of the ideas
of this talk was to try and start a discussion because
I think in the C++ community, we overlook a lot of the problems of actually running C++ in production. It's quite hard work. And we worry too much about angle brackets and syntax and the newest
feature of lambdas. So with that, my time is
up, so thank you for coming. I'll see you around. (audience applauding)
As a sometimes hobby programmer, I found the logical progression very easy to follow. Very welcome. I have a much better idea of how to trace my execution now!
Anyone knows the name of the libraries suggested at the QA time? I didn't understand the names.
Does anyone know if there is a slidedeck available for this talk?