CppCon 2018: Steven Simpson “Source Instrumentation for Monitoring C++ in Production”

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

As a sometimes hobby programmer, I found the logical progression very easy to follow. Very welcome. I have a much better idea of how to trace my execution now!

👍︎︎ 4 👤︎︎ u/CarbonIsYummy 📅︎︎ Oct 16 2018 🗫︎ replies

Anyone knows the name of the libraries suggested at the QA time? I didn't understand the names.

👍︎︎ 1 👤︎︎ u/set-eirc 📅︎︎ Dec 21 2018 🗫︎ replies

Does anyone know if there is a slidedeck available for this talk?

👍︎︎ 1 👤︎︎ u/matthieum 📅︎︎ Oct 15 2018 🗫︎ replies

Captions

- Good morning. Thank you for turning up so early. I'm very excited and surprised to see that you all managed to get out of bed. I'm gonna start by admitting a little secret to you. One of my very first jobs I got early on in my career was as a junior verification and modeling engineer. I had no idea what this is. I knew two things. I knew it was programming, and I knew they were gonna pay me. I knew absolutely nothing else about this job. Would anyone else like to admit that they've done the same thing? Fantastic. (laughing) That makes me feel a lot better. So what this job was actually about was building silicon chips, not processors. They were actually for networking. But what's really interesting about this process is these designs start off looking a lot like software. You write some syntax in some text files and through a very long and drawn out process, they get manufactured into these silicon wafers which then become your chip. This process is extremely expensive. A conservative estimate might put it at $5 or $10 million, and that's really being conservative about it. So this is where my job came in. My job, as it turned out, was to write tests for these designs in C++ because you want to get the design right the first time. That rarely happens, but that's the idea. And so we would incorporate simulations of these designs into C++ code and write tests for them. Because this process of making these chips is so expensive and so drawn out, when you get a bug in the chip, you really want to make sure that you can fix it and work out what's going on without having to change the chip. And so you lose, you don't have a lot of the niceties that you have when you're building software. You don't have any debugger in the same sense. There's no running a debug build. There's no just adding some extra logs. What you do have though, is a huge amount of statistics in the chip. Huge amount. Little registers that you can read numbers out of and these numbers are accumulating various bits of information about what's going on inside. And this is all you have to debug anything that might go wrong in the chip. And you have to make sure it's in the chip design from the start. There's no adding it in afterwards. This is not possible. So let's compare this to how we build C++. We write a similar set of code and the equivalent to manufacturing I guess, would be putting it into production, so compiling it, testing it, deploying it, shipping it. But how often do we think about how much this actually costs? It could cost a very small amount of money. Is that actually quite expensive? How easy is it to actually change code that we've put into production? This is gonna depend heavily on what domain you're in, but it's worth thinking about because I think even if it's really easy for you to update your code in production, you want to do it as least as possible. So what do we mean by production ready? We talk about this a lot, especially in code reviews, for example. Is this code production ready? I wouldn't do this in production. Things like that. Maybe if it compiles, it's production ready? I think we've all worked at places where this is true. If it's tested, is it production ready? Well, one thing that production ready does not mean is bug free. You always have to account that something might go wrong in production. So how about this. Observable. Think back to that chip. We made sure when we were designing the chip that it was observable so we could work out what goes wrong when it has been manufactured. So this is what I want to talk about. How can we make our software observable so that when something goes wrong in production, we can easily work out what went wrong and how we can fix it. So let's talk about instrumentation. It's in the title of the talk. Let's get a definition. Should always start with a definition, right. Wikipedia says instrumentation is a collective term for measuring instruments used for indicating measuring and recording physical quantities. Fantastic, that's quite useful to know. And even better, it says in the context of computer programming, instrumentation refers to an ability to monitor or measure the level of a product's performance to diagnose errors and write trace information. This is exactly what it is. But we should get a second opinion, so let's ask Urban Dictionary. Great source of all technical knowledge. Urban Dictionary says, has its little example at the bottom, "Hey John, look at that expensive instrumentation." And what Sue's actually talking about here is the fact that if you add instrumentation to your software, it may have overheads, so it's actually expensive. So what I'm telling you is Urban Dictionary is a great source for C++ programmers. So instrumentation allows us to observe deployed software. We can monitor that it's working, we can measure the performance of it as it runs, and we can diagnose errors when they occur and trace them back to what caused them. Specifically in this talk, we're gonna talk about source instrumentation. So this is adding code to your source, adding features to your code to instrument it. And this is built in to your production releases. I really want to emphasize this. The instrumentation is not something you just do in your debug build. It's something that's in your product, in your software. Your software becomes instrumented and runs in that form. And there are some alternatives to doing this, but this is not what we're gonna talk about today. The best form of instrumentation, printf, I mean logging. Always gotta log. So this is what logging is for. At some point early in the morning, you will get a message saying your software crashed. This will happen, prepare for it. And you will ask, what's in the log. And what they will tell you is std:: bad_alloc. And you then have to work out why. And you might then question why you became a software developer. Maybe you start looking on jobs, on Stack Overflow for new jobs, but a few hours later you'll probably ask them to reboot the machine and maybe it will fix everything. Okay. Let's get a bit technical now. So I have this little example program that we're gonna work through in the talk. Just a basic shell of a C++ program, does some processing and any exception that occurs in the processing is printed out. We probably all wrote little snippets of code like this. But what we're gonna focus on is this error that's coming out of the code, and we can assume that it's coming out of this process file function that we've read. And maybe reads a file and processes the contents of the file. So what we really need is a lot more information about this error. This permission denied message is not nearly enough information for us to work out what's gone wrong, so we could add some context. And this is where we start talking about logging. Maybe we use cout just to print some extra information out, and in this case wouldn't it be useful to know which file we were accessing where the permission was denied. Maybe in addition, we want some context about why this operation is even occurring, what user caused the operation, what connection the request came from. Of course, this will vary depending on the sort of software you're working on. And then on top of this we can clean up this error message that we're sending to the user through the exception because the user doesn't care that permission for something has been denied. What they care about is that their request failed and maybe we tell them a little bit of information. We don't expose too much of the internals of our application, but this is a good start. Now we have a lot more information and maybe we can work out what went wrong. So these little bits of writing stuffed to the screen at some point we'll think, okay, we're definitely logging now so we should use a logging library, be professional about it. So maybe we use something off the Internet, maybe we get something to come up, package manager, maybe we write one ourselves, maybe the company has their own logging library. Whatever it is, you know it's gonna look a bit like this. Maybe you get some ability to use format strings, you can specify severity levels, and have lots of useful features for writing log files and maintaining them. But this really isn't what's interesting, I think, about logging. The problem with logging is us, the humans. We've made logo files to be human readable so that we can read them and understand quickly how to fix a problem. And this is fine if you've got a little log file or a couple of files to read. The problem gets increasingly worse when your software grows or more people are using your software. You gradually become less and less happy because we don't scale, we can't read thousands of log files. It's not possible. And so we started to imagine was this extra layer between our logs, which for the purposes of this talk, we'll just call magic. And this processes all this log data we've got and puts it into a nicer form for us to be able to read. I'm not gonna go into too much detail about this magic, but roughly speaking, what it refers to is a growing ecosystem of software and services that let you process and understand your logs better. And this huge amount of software that does this for you, and if you're running your applications on the cloud, then you can probably just pipe all of your logs into some service and it will sort them out for you. So what this gives you typically, is some ability to search through your logs for particular errors or a particular time. It might do some reporting for you, produce some metrics, tell you how many errors occurred on a particular host, for example, and maybe it'll give you some alerting so you can only be notified when really important things happen. You're not constantly reading through these logs. So this is a very rough overview of this systems. The problem comes if we want to use these systems all we have to feed them is this human readable text that we munged together from all this information. So the first thing a lot of these systems do is process this data into a much nicer format, give some structure to it. Fill out, for example, the username and the IP address so we can search on these things and we can index all of our logs and all of our errors by user and say which errors occurred for this user. These are typically implemented in various ways, but the most common is this mess of regular expressions you end up writing to parse out the data from your logs. But this is insane. We already had those bits of data in a nicely structured form in our application, and we merged them into a text format, which we then parsed back into a structured format. What we really want to do is just output the data that we had in a structured form. It's still human readable if we need to, but it makes it so much more easy for machines to process it. And this is what we want to do, we want to automate the processing of this huge mess of log data that we've acquired. And this is becoming very popular in other languages. It doesn't seem to have caught on so much in C++ yet, but I think it's something that is worth mentioning because it allows us to eliminate all of this unnecessary work. And this idea of structuring our data brings us on to the more interesting topic, in my opinion, which is tracing. Tracing is basically logging, but it gives a bit more information about what your logs mean. So a trace is typically something that has a start and something that has an end. So an operation that takes some amount of time. And the questions we're interested in asking and answering with tracing is what caused the error. We want to build up a history of what happened in our system so that we can trace back the source of the error. And the other thing we're interested in doing is looking at performance, so how long did something take. A good example of this is strace, brilliant little utility which instruments some program and logs out the system calls that are used by the program. So if we look at this example of catting some file out, it will tell us that open was called because we need to read the file. And later on we, call another system call to read the data out of the file. The first thing this tells us is what actually happened. The system call, the arguments and even the error message, the error code, sorry, which is very useful in itself, but we also learn the time that this occurred and how long it took so we can start looking for bottlenecks in the code. So the way this tends to evolve is you start with having some logging in your application. Maybe you have this process file function that we touched on earlier. It reads a file, opens a file, and then reads the data and processes each line of the file in some way. And we log it and we log which file we read and we log which you user did it. Useful. Wouldn't it be nice if we knew when it ended because then we know how long it takes. We know when our code is finished with a file and it closes it off. So eventually we realize, okay, now we're doing tracing, we're not doing logging anymore, so we'll use the library for that. And typically you get these little utilities that look a bit like your log code, but they'll produce some instance of some trace object for you and then you can end it when your operation is finished, and many of these tools will, if you don't do it explicitly, the destructor will do it for you. So if your function finishes then the end trace gets written. And you can add the same information as we did with the log. Any sort of arbitrary information you think might be interesting, filename, user, IP address, you can add that as well. So let's look at an example of how this might be output because this data doesn't have to be output as logs, but it's quite common to do so. Maybe we get these file read events with some metadata and then we get the end, maybe no application. It opens up a couple of files and processes them. Well, it's common to do kind of like we looked at with the strace example is to only emit one event or these two events, so we emit the event when the operation ends and we emit all of the metadata and then instead of the start, we just record the time that the event started. Alternatively, we could do, record the duration. You get the same information regardless of what you put out. Where tracing becomes really interesting is how you build up relationships between operations. So if we have some connect operation, some user starts a connection to our service, we can trace that. Then inside that connection, maybe we have to read some files, so we have another operation within an operation. And so what we can do is actually link these together through some scheme or another, some incrementing integer or some sort of you UUID, and we can tag the inner trace with the ID of the outer trace. So now we get this relationship between the traces. You typically don't have to worry about this because if you're using a sufficient library, then it will have support for building these relationships. So if you create your trace and you create your inner trace from the first trace, then they will get linked together through some identifier. What's even more interesting is that you can often write logs to traces, so now you can associate errors with particular operations, and this cascades. So you have an error at the very bottom of your application, you failed to write a file. You can then trace it all the way up to where the client invoked that operation that ultimately failed. But generally, this is where the name tracing comes from 'cause you get to trace the origin of your errors. Let's look a little bit more at these relationships. We can trace errors from client to root cause, but we can also see within an operation where bottlenecks are. Is one operation taking longer than another. And this is actually essential for systems with any sort of concurrency in them. Because if we imagine two concurrent streams of processing going on here, the interlinked, sorry, the interleaved bits of one happen, then bits of the other happen, then bits of the next one happen. So by building these relationships, we actually know the correct flow and we don't have to infer it from just reading through the log file. This log file isn't ordered, it's just a complete mess of all the things that are concurrently happening. So, pretty pictures time. A really nice way to visualize these traces is with little timelines. We can, for example, display all of the traces for a particular user. Every time a user causes a file to be read, we can display that, but what we can also do is then display the connection for that user and we can draw the links between them. So we start getting this sort of, it looks a little bit like a core graph, like the connection triggers the file read and we can add the other data in the file alongside it as well if we're interested in correlating the two together. This becomes even more interesting when you think about software that is broken up into multiple address spaces. For example, multiple processors or if you have a distributed application were parts of it are running on different nodes. You can then link operations which occur in completely different processes and those together, and you can say the cause of this was actually something that happened somewhere else, not in my address space. So this has got a lot of benefits for doing, just by adding, taking our logs, but just adding a little bit more structure to them. That's all we're doing. So in a C++ application, we care a lot about performance, so we need to be very careful about the overheads which we incur. One thing we're particularly interested in is being able to disable our tracing, and this is some of the concepts of what you might do for your logs. You want to be able to turn them off, and when you turn them off, you ideally want them to have no overhead, ideally. You can't always get this low, but with some systems you can. And then what we care about is of course, the overhead where the tracing's turned on because when it's turned on is when it's valuable. And there's a few things we can do. A very common technique is to sample your traces, so only actually emit information for every 10th operation or every 100th operation. We could aggregate our trace data, so instead of outputing all the details of every trace, we just output the count of things that happen and we just output the total time that all of the operations took. This still has a lot of value because we can then see, for example, the average time that our operation took. And if you're thinking this sounds a lot like what I get from my profiler, then you'd be right. In a sense, a profiler that you might use for development time is like a very specialized type of tracer for finding performance bottlenecks. And then we can also build tracing systems but don't actually use logging but a much more efficient version than formatting and writing out huge blobs of text to files. We can imagine much better optimized binary formats. If you think about the tool tcpdump, this is a tracing tool, fundamentally. It's tracing your network activity in and out of a node and it has a very specific format for storing the traces, the network packets, in a file. And if we go to look at the Linux kernel, they have a tracing file format as well. Linux has this very elaborate mechanism of adding trace points into the kernel so you can see what's happening. With this, the big problem with any sort of tracing and any sort of logging is that the overhead grows as your application does more things. If your application does 1,000 things, then you have to trace 1,000 times. If it does 10,000, you have to trace 10,000 times, unless you're doing some sampling, of course. So the overhead always grows as your application starts to do more things or you want to trace more parts of your application, which leads us onto why we want to talk about metrics. What are metrics? They're just numbers, interesting numbers polled periodically. And a really good example of this is htop. Have you ever seen a screen like this? Little utility you can run and get it on most operating systems. This is page htop in particular, but fundamentally they're the same. You get a lot of really useful information about your system. You get your CPU usage, you get the amount of memory that you're using, number of processes that are running, the uptime of your server. Extremely useful. But notice the wide array of different types of numbers. We've got duration, we've got counters, we've got absolute values like memory, and we've got relative values like CPU usage. This is really useful. And typically what we want a metric for is the history of it, so if we have memory usage of a node, when our bad_alloc occurs in our production system, we're able to look at this memory metric and see, well, something happened here, we should probably take a look at that. And then we can build alerts on top of this. So say, if our memory increases over a certain threshold, then tell us and maybe if we're clever enough, we can find out why it broke and everything goes back to normal. The typical workflow for system metrics at least, is very commonplace. This has been a technique that's been around a long, long, long time. You collect the metrics from your servers, you store them somewhere, and then you analyze them. And there are a huge number of systems that will do this for you, and again, if you're using some sort of cloud provider, well, they also have a metrics collection system you can use. What we want to do when we're developing software is hook into this. We want to expose our own metrics and have them collected and have them analyzed and alert on them and we want to get all the same benefits that we get when we're monitoring our infrastructure and our service, our temperatures and CPUs. Let's look at what a metric is made of. We give it a name, temperature, some sort of count, number of things happened, and we tag it if we have multiple versions of that metric. We have multiple hosts then each of which have some temperature sensor, then we can tag it and say specifically which one this is a measurement for. And of course the value. And then the timestamp, which we took the measurement for. This example happens to be open metrics. It's a evolving an open standard for passing metric data between systems. Let's get back to some code. That's why we're here. Same example as earlier, little process file function. Well, in a metrics library, and this library, it will be best to use the same library that, the library that your infrastructure recommends you use. So if you're using a particular type of monitoring software for your infrastructure, then they will probably have a C++ client that you can use to expose your own metrics. What we can do is start building metrics in, for example, maybe we want to count something. Maybe we want to count the number of times we read a file. So we can add a little counter, we can add some tags and some metadata to do it, and then we can increment it every time we read a file. The counter itself will typically look like this. It will have some integer inside it, probably an atomic integer, and some function to increment the count, and a function to obtain the count. Not very complicated. The idea behind this counter though, is to keep that increment as lightweight as possible. We want it to do as little as possible so when we add it into our code, so every time we read a file, the overhead we're adding in to our application is negligible and a lot cheaper than if we were to put a log in the same place. Typically, this will vary depending on the library and the infrastructure you're using, but how you collect these counters will be through some thread that's running periodically and picking up each value from each counter. You have some registry somewhere in the library of all the counters and all the metrics, you run through each of them, pull out the value and publish them. This is quite heavyweight work. Maybe we're formatting the metrics into text, maybe we're sending them over some network socket or writing them to a file, but it doesn't matter because we were only doing it every time we pull the data. We're only doing it every five seconds or every 10 seconds. That increment, the thing that we actually put in the critical path of our code is still extremely cheap. So having these counters doesn't incur too much overhead. I'm sure some of you are thinking though, I can definitely do better than this. I know you're thinking it. It sounds like a really interesting problem to really optimize that increment, to get it as fast as possible. Well, yeah, other people have thought about it too and they thought it was really interesting and they wrote papers on it and they tried to standardize it. So if you're really interested in how you can write really efficient counters, there's a good paper for you to go and read. Let's look a different example. What about if we wanted to count the number of times we read a line from a file. This is a much more frequent occurrence than just reading a file in its entirety. So this loop is critical performance wise, but it's still fairly heavyweight. We're pulling out a line from a file, processing it some way, pausing it, so on and so on. But it's still quite a good, this makes it a good candidate for adding a metric to because the increment relative to what you're doing is still fairly lightweight. However, it's still a nonzero overhead and there will be situations where the cost of incrementing a counter is still an overhead to your operation, so we have to think about is the information we're getting valuable enough to warrant slowing down the code. Let's look at the data that we might get out of this counter. Say we're pulling it every five seconds as our application starts up, counter zero, nothing's happening, and then something starts happening and we start processing data, the numbers start going up. This is meaningless. You can look at that and really infer nothing other than the number went up a bit. What we really want to do is visualize it of course, and now we get to see some really interesting things. We can see roughly when processing starts and when the processing finishes. These flat areas are where nothing's going on. And we can see roughly the number of lines that we've read through each file, when this leveling off occurs. What's even more interesting is when we host process the output from this metric and for example, graph the rate that the counter is increasing at. Now it becomes even more obvious when our operation is starting and stopping. We can see it roughly takes, the first one roughly takes 40 seconds and we can see very clearly that two operations occurred and we can see the throughput, so we can actually see the performance of our processing loop on the graph. And even better what we can do is we can see if the performance changes throughout the processing, and this is something you wouldn't typically see if all you do is collect the time your operation took and the number of lines you processed. You would get an average over the whole operation. What you see with this counter is you see if the rate changes. So this excites me a lot. If we tag our metrics in a nice way, we can correlate what's happening depending on different dimensions in our system. If we have multiple users, we can see that one of the users in our system, when they begin requesting, affects the performance of the other one. And we can do other processing to these numbers as well. We can graph the sum of the rate for all the users in our system, and now we learned even more. We learned that there's some sort of startup, slower performance and then after some time, we see the performance increase, so maybe this is an effect of file caching. Once you've read the file once, it gets stored in memory and so it's faster to process it a second time, and you can see that there's some sort of limit. So even though we have two users running requests in parallel, there's some sort of ceiling to our performance and all of this information comes from adding just one counter to your loop that's doing your interesting processing. I think this is pretty cool. And we can go one step further with our metrics and our pretty graph. We can put this side by side with our tracing data that we found earlier, and now we start to fill in even more gaps in our knowledge. Just looking at the metric, we don't necessarily know whether this is two distinct operations or whether it's one operation that happened to dip in performance very drastically, but if we look at the tracing data alongside it, we can verify that in fact, it was two distinct operations, and from the metrics, we can see the performance within the operations. I think this is really valuable lesson to take away. Just very simple additions to your code can tell you so much information, but you do have to put a bit of thought into it. It's not effortless. So, we're at the end. Nearly time for some coffee. I'm desperate for some coffee. What am I trying to say? Develop observable software. Think back to that chip I talked about at the start. We had to make sure that chip was observable so we could work out what happens when it goes wrong. And I think there is a huge advantage in doing this with software as well. Try to debug in development as you would in production so when something goes wrong in production, you don't need to install your debugger on your production server. You don't need to install tracing tools or some other form of instrumentation. Your software is monitorable and observable. Your software becomes its own debugger. But we have to take into account that while there is a lot of information to be had there there is overheads in doing it, but it doesn't have to be expensive as long as you're mindful of where you use different techniques. We can log and we can log errors because we have to and we should. We can trace things at a very coarse granularity that happen infrequently, and then in our more hot loops, we can think about adding some metrics instead so that there is less overhead incurred. There are always trade offs, and the techniques here are very complimentary. You use them together, and this includes other types of instrumentation. Just because you're adding some instrumentation to your code, to your software, doesn't mean you can't still use your debugger or tools like strace or other instrumentation tools. That doesn't mean that you can't use your compiler to add instrumentation as well. But with source instrumentation you can choose what information to expose that is useful to your particular domain. So if you're writing some sort of video processing framework, maybe you're interested in counting things like the number of frames processed. If you're building a database, maybe you're interested in the number of times a table is accessed or a row in a table is accessed. This is all possible when you actually think about what instrumentation and what information you want out of your code. So with that, I'm gonna thank you for coming and I hope you enjoy the rest of your conference. (audience applauding) We have 10 minutes for questions and there are microphones if anyone would like, or I am around until Friday so please feel free to come and talk to me. Yes. - [Man] So I've had to look into tracing quite a bit myself and the availability of C++ libraries is poor I think. - I'm sorry, could just, a little bit closer. - [Man] The availability of a C++ libraries for tracing is poor at best, I think. And I'm not really aware of any metrics like this. What's your experience? Have you got any suggestions? - So the question was, are there any specific examples of metrics libraries and tracing libraries for C++. I left this out of the presentation sort of on purpose because as the gentleman says, the choice is, while there is choice, there's no sort of defacto standard. There's no, and this is even true for logging libraries. There are thousands of logging libraries. Every framework has its own logging library. Every company I've worked in has their own internal logging lobby, there's no standard for it. But to directly answer your question, the library I've had most, the infrastructure I've used the most is a piece of software called Prometheus, and it has a number of C++ clients which let you expose metrics to Prometheus. So it's very, the problem is that these libraries aren't standard and they're are often specific to the tool you're using, which is unfortunate. And for tracing, there is a evolving open source project called OpenTracing, and that links into a piece of software called Jaeger, which is a distributed tracing system and that has C++ clients. So those are two things you could look at. But this is an interesting point. I think we could do a lot better to try and evolve some libraries, maybe not necessarily in a standard context, but at least as a community where there are tools which become the defacto standard. So if we use a library from here and the library from here, we write some code, we can use a common library to introduce metrics and traces. Yeah. Yes sir. - [Man] Yeah, I guess all three of us had the same question. Can you say a little more about this Jaeger library? Does it work with just other processes across C++ or does it work cross language as well as cross process? So my company, we use scripting language in addition to JavaScript, in addition to C++, and having visibility across all three would be great. - So the question was specific to Jaeger, and I guess perhaps generally with tracing clients, is there any cross language clients we could use so you can collect tracing information from different parts of your stack. So the answer to that is yes, Jaeger specifically supports pretty much every language. Jaeger is an implementation of the OpenTracing standard, and the OpenTracing standard, this OpenTracing has lots of clients for all different languages, so you could put some traces in your C++ code that actually call operations in a different language and then still link those traces together. So I think you mentioned JavaScript. I'm not 100% sure, but I would be surprised if there wasn't the JavaScript client. There's definitely things like Python and Ruby and anything like that. (audience member speaking unintelligibly) - [Man] Okay. Alright. I guess just as a comment, we effectively had to hand roll our own instrumentation library. And for those who might use Intel's TBB, they do have an enumerable threads storage counter. It kind of does the atomic thing you were talking about. - Yeah. - [Man] Yeah, so just as an FYI. - Yeah. - [Man] So thank you. - Any other questions? Yes sir. - [Man] Can you just spell Prometheus, how is it spelled so that I can look it up? - The Prometheus, what is... - How do you spell Prometheus? P-R-O-M etheus. (laughing) It's the same as the film, the alien film, Prometheus. It's a great metrics collection software. I quite like it myself. Yes sir. - [Man] Hey, I think I heard you say that you should be careful about not putting instrumentation in everywhere. - Yeah. - [Man] And while it's easy to agree with that, I think the recommendation should be instrumentation is a key functionality of your software. You cannot operate the software without instrumentation and you should put it everywhere where it's needed, just like basic functionality of the thing, right. If your thing is supposed to compute a square root, you compute the square root. - Right. - [Man] If you're supposed to run the software, you instrument it. - So the comment was maybe the advice should be you instrument as much as you can and especially things that are important. Is that right? - [Man] Well, in a former life I was an SRE, and the advice I would give any developer is you define the SLOs and SLAs for the software that you're building and then you build the instrumentation that you need to monitor for those SLOs and SLAs. Period, right. And you want to instrument a little bit more than that so you can troubleshoot and find out why you're in violation of your SLOs, but you start from what are the properties of the software that you want to maintain or the system, and then you put all the instrumentation you need to be able to maintain those properties, right. That means you have to buy an extra CPU and AWS or Google Compute Engine in my case, then go ahead and do it. - So this is an extremely good point in that sometimes the instrumentation is critical or even a requirement of your software and so any overhead has to be acceptable and you just have to deal with-- - [Man] It's not overhead, right. Overhead is stuff that you don't need. In this case, instrumentation is something you need or it's hard to have the functionality of the system. - So this talk is actually about half the size that it started out, and I had a section about exactly this. The trade-offs, it went into much more detail about the trade-offs between different types of instrumentation, how much overhead is good enough or little overhead, and when you might use different things. When you need to use instrumentation because as a requirement from a customer or you have, for example, an SLA, then how can you make a trace as efficient as possible. Unfortunately, that turned out to be far too much to cover in one go. Thank you for your question. Yes sir. - [Man] I was gonna basically agree and I think, but with logging it's interesting I think, is that often, at least in my experience, is that you put logging where you need it to solve a problem. You don't put it for the future problems you're going to have. So it's usually too late. And I wonder whether tracing, I suppose your comment is you put it everywhere or for the SLA or however, but it's a very difficult problem for knowing exactly beforehand where to put it. I think it's an interesting topic for people to do. So you could do at kind of a business level. So you say, right, okay, but we'll put logging, tracing and metrics on business terms if you like and not do it any lower. And this also depends on what you're using them for. Are you using them to monitor or for a debugging tool. Logging is often more a debugging tool whereas tracing is more monitoring and metrics, I think. - So the summary of that comment I think is that logging is often more geared towards things that you absolutely need to record, like errors and tracing, and specifically, the gentlemen said, errors are logs of the things that you know you need to record whereas traces are typically more for things you think you might need to record for future issues that you don't know about. And yeah, I completely agree. I think the more you start to look at tracing as something distinct from logging, you realize that logging is really just for errors. It's about adding context to errors, maybe warnings as well, but that's really all you should be using logging for and if you have some customer requirement that you produce a log file in the format, then of course, you need to do some logging. But I think if you generally all this sort of info logging, the debug logging, we should be thinking about in a more structured way, perhaps looking at using some tracing instead. And one of the ideas of this talk was to try and start a discussion because I think in the C++ community, we overlook a lot of the problems of actually running C++ in production. It's quite hard work. And we worry too much about angle brackets and syntax and the newest feature of lambdas. So with that, my time is up, so thank you for coming. I'll see you around. (audience applauding)

Info

Channel: CppCon

Views: 15,724

Rating: 4.8580647 out of 5

Keywords: Steven Simpson, CppCon 2018, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording

Id: 0WgC5jnrRx8

Channel Id: undefined

Length: 59min 39sec (3579 seconds)

Published: Fri Oct 12 2018