Getting Started with OpenTelemetry - Ted Young, Lightstep

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is ted young and my pandemic haircut is a hat getting a tubing getting started with open telemetry okay but what even is open telemetry come to think of it what's telemetry the cambridge dictionary defines telemetry as the science or process of collecting information about objects that are far away and sending the information somewhere electronically open telemetry is an observability platform a set of well-factored components that can be used together or a la carte open telemetry collects up a variety of observations distributed tracing metrics and system resources being the most important rather than treating these as separate signals open telemetry braids them together and provides indexing and context that allows you to aggregate and cross intex all of these signals on the back end in addition to the data collection open telemetry provides a data processing and pipelining facility that allows you to change data formats manipulate your data scrub it etc etc all the tools you need to build a robust telemetry pipeline in a modern system let's do a quick overview of the major software components that make up open telemetry inside of every service in your deployment install the open telemetry client we refer to the client as the sdk the sdk in turn has an api your applications frameworks and libraries use this instrumentation api to describe the work that they are doing the sdk then exports the collected observations to a data pipelining service called the collector open telemetry comes with its own data protocol otlp but the collector can translate otop into a variety of formats including zipkin jaeger and prometheus notably open telemetry does not provide its own back end or analysis tool this is because at the heart of open telemetry is a standardization effort the goal is to come up with a universal language for describing the operations of computers in a cloud environment the goal is not to standardize how we analyze that data because standardizing how you analyze things isn't really a thing instead we hope that open telemetry will help push the world of observability forwards by allowing new analysis tools to get started quickly without having to rebuild this entire ecosystem of telemetry software speaking of software ecosystems how does open telemetry keep track of all of this code in order to ensure that different implementations remain consistent with each other and continue to interoperate open telemetry is designed via a specification the specification is a language neutral document which describes everything you would need to build open telemetry before we dive into the details i do want to point out that it's very easy to install open telemetry at the time of this recording i recommend four languages for production-ready beta java javascript python and go in all four of these languages there's an easy quick start guide that i've written over at opentelemetry.lightstep.com we have easy installers it's usually just a line or two of code or in some cases even a command line configuration to get everything installed and going okay let's just show how easy it is to get started by trying it in java so i've got my pet clinic app here this is just a java sample application and i'm going to just boot it up so you can see what it looks like so java jar pointing at target there we go so that starts the application up it's java so it might take a bit there we go looks like we're off to the i go here and take a look there it is there's pet clinic click around very nice okay kill that app now i've already downloaded um the open telemetry java agent it's wrapped in a little launcher that makes it easier to work with light step but it's basically just the java agent and so if we just attach that to our prior command then we should see the application boot up but start running open telemetry and we can see some open telemetry messages flying by so we know it's been installed it's a little bit noisy because it's still in beta now that we've started up have a look the app is still working find some owners run some errors make some more errors happen add some people cool and this exception you can see because i actually triggered an exception here this will trigger an exception in the app and since all of this is hooked up into my account in light step if i go here in light step and go have a look at our explore page boom i can see all this stuff coming in from our example pet clinic gap okay so that's the basics that's how you do it in java so in this talk we're going to cover open telemetry basics starting with what it is we're actually trying to observe the fundamentals of how open telemetry approaches observation and the basics of getting all of this stuff set up and deployed in your actual system okay so let's look at a quick example so we understand the kind of transactions we're talking about here so imagine you have a mobile client that wants to upload a photo and a caption okay so this client will make a request to a server quote unquote but that server of course will be a lot of servers there will be a reverse proxy sitting in front of the app that will shut the request off to an authentication server and then take the photo and upload it to scratch disk and once all that's done then it sends the request on to your application the application then uploads the file after treating it or whatever it's supposed to do to cloud storage so s3 or some equivalent thing like that we then store the url to the image along with the caption and user data and anything else important in our data service that data service is another web application which then sits in front of a redis cache and a sql storage device of some kind why is it built like that who knows someone said build it this way and someone else said okay this is our app but look at this thing this is the most basic app ever i feel like i've been looking at lamp stacks and other things that more or less look like this application for at least 20 years and these applications are almost as annoying to observe today as they were back then so this view here is more of a service map approach to looking at this transaction let's change gears and look at this transaction from the perspective of a call graph in this graphic each line represents an operation the length of the line represents how long that operation took and we can see how these operations are connected to each other via network calls now as an operator there's a number of things we care about when we're observing our system first of all we care about latency why is it slow is such a cranky question to answer so when we look at our transactions we want to know where the time went specifically which sub-service actually was spending the time and where did we spend time waiting next of course is that we care about errors we want to know which operation and which component and which service actually had the problem when our transaction is failing now to root cause your latency and your errors you're going to need some additional information the first kind of information is fairly obvious you need to look at events these are called logs and other systems we call them events it's the same thing what is the sequence of operations that were involved in this problem or the success besides looking at latency errors and events within the context of a single transaction we also want to compare them across transactions knowing that a critical error correlates highly with a particular host a particular project id a particular route is critical critical information when you're trying to form a hypothesis and root cause your system the real issue of course is scale as your system grows and grows the number of logs grow and grow and the percentage of your logs that are relevant to any particular transaction or issue shrink and shrink after your system reaches a certain size it becomes impossible to paw through your logs by hand the ability to contextualize and index all of these events is open telemetry's killer feature and how does open telemetry achieve all this awesome indexing that's right context propagation context propagation is the core concept behind open telemetry's architecture if you can understand context propagation then everything else going on in open telemetry will fall into place so how does context propagation work imagine we have two servers and they're connected to each other via network request all of open telemetry's indices and other transactional data is stored in an object called the context this context object follows the flow of execution through the program when a transaction moves from one service to the next via a network call all of these key value pairs must come along as well sending along the contents of the context object as metadata on the network request is called propagation on the client side the contents of the context object are injected into the http request as http headers then on the server side the same values are extracted from the http headers and deserialized into a new context object which then continues to follow the transaction through the new server so what are these context propagation headers look like there's a variety of them out there in the wild but the two that open telemetry is focused on are being created through the w3c tracing working groups so these are official http headers for distributed tracing and context propagation the first header is called trace context it has two header fields the first one is trace parent the second one is trace state trace parent contains several specific ids that are important to tracing first of all it contains the trace id the trace id is the transaction id this is the id that's going to be stapled to every event and operation in your system the next id is the span id this id represents the parent operation of your current operation so every event occurs not only within a transaction but also within an operation and those operations have unique ids there's also a sampling flag to check whether sampling has been enabled and then there's a second header called trace state which honestly you don't really need to worry about this header is just for internal details for tracing systems to share with each other the real important part is that by standardizing uh these headers through the w3c we can get everyone to start agreeing on what headers we're actually using what value types we're actually using and this is going to be really important for something that depends on interoperation as much as distributed tracing does okay so those are the tracing specific headers there's another set of headers called baggage and baggage is for arbitrary context propagation literally baggage headers can carry arbitrary key value pairs the entire point of baggage is just to propagate context from service to service so a good example of a piece of baggage you might like is a project id perhaps early on in the transaction you gain access to a project id and while later services in the transaction may not directly have access to that project id you may still want to index some of their operations or metrics by project id in which case you can throw that project id into baggage and then pull it out of baggage later and attach it to your operations and metrics and that is the power of open telemetry so we're getting close to the end let's do a recap if you want to get started with open telemetry the first thing to do is audit your system and cross-check whether the languages you're using are ready to go in open telemetry as i mentioned before the four most production-ready languages are go python java and javascript erlang is also getting ready to go there's a whole bunch of other languages raiding in the rings but it's best to go actually say hi to the implementation working group or otherwise kick the tires before rolling the stuff out into production because it really is still in beta if you are interested in go java javascript or python check out opentelemetry.lightsip.com that's where i'm putting together all of the guides getting started material helper functions bootstrappers everything to get started quick with open telemetry that isn't currently baked into core but i would kind of like the cncor is currently going on opentelemetry.lightset.com that's just sort of where i'm hacking trying to push the edges of understanding how open telemetry works today and how it should work tomorrow so there'll be a lot of great content getting added there soon uh if you want to know when new content's getting posted you can follow me on twitter i'm ted suo otherwise come check out open telemetry we have a lot of meetings uh every day of the week there's an open telemetry meeting you can find them all on our calendar check out our getter if you want to just come in and chat and say hi we're on gitter all the time and otherwise please kick the tires give it a shot give us feedback we hope you get involved thank you yellow yellow my name is ted young you may remember this shirt from the video you just watched um cool so this is my first time on hop in uh it's not entirely clear to me whether people uh can use voice uh it looks like people can only use uh chat uh for feedback hi david good to see ya um so first of all uh any questions about open telemetry in general there's no dumb questions uh it can be super basic questions it could be super uh specific or techno questions i'm happy to answer any open telemetry related questions at this time and if we don't have a lot of questions i actually have questions uh i'm trying out a new presentation style this is actually my first time trying to put it together it's definitely a bit of a rough draft but i'm curious how it went over there's aspects of that presentation that you liked things that you didn't like or anything that was confusing so either here or dm me i'd love to get feedback so jonathan molina asks the java library i spoke about is that specific for the light step service or can it be used as something like jaeger or stackdriver uh it can actually be used with jager or stackdriver so to explain what the open telemetry launchers are they're just an open telemetry distro in other words if you use open telemetry today um the setup code in particular um it's it's got a fair amount of boilerplate it hasn't really been packaged up into something that makes it look kind of nice so all the launchers are are just a sort of pre-packaged version of open telemetry in particular there were some settings we needed to touch like grpc headers and stuff so it was just a little ugly to set up without wrapping it up so that's what the launchers are you can uh crack open open telemetry and add any configurations that you want so there's no reason you can't use the launchers with zipkin or jaeger or any other system and if there is a feature that you think they're missing we'd love to get pull requests and eventually i hope to get something like the launchers baked back into open telemetry so have a look at those apis and let me know what you think cool so david mentions uh that there used to be some competing standards on open tracing can you help clarify the difference between open telemetry tracing and competing standards okay so when we talk standards there's a couple of different pieces people are talking about so one piece is context propagation so like we were talking about in the video there is a protocol for sending context from one system to the next is metadata on a network call and for http you're doing that as http headers now there are several headers out there in the wild the most popular is probably the zipkin b3 headers so those are the ones when it comes to in our operation today those zipkin headers are probably the most popular of course every single tracing system has their own custom headers so it's sort of a bespoke landscape that's where this w3c project uh originated from so that we could have an official standard within the http spec rather than a sort of you know de facto standard besides those http headers the other place where maybe there's competing standards there's certainly competing implementations of distributed tracing um i think that's totally fine uh two projects that were maybe a little too similar were open tracing and open census that was basically everyone in the distributed tracing community who really wanted to work on a particular project uh in a particular way which was standardizing the language we're using to describe systems and since we had these two projects that were nearly the same both trying to create a standard it was starting to turn into that xkcd comic where you end up just creating 17 standards and so we ended up merging those two projects so that's actually where open telemetry comes from it's sort of the version 2.0 of open tracing and open census and uh david asks whether open telemetry supports open tracing that's correct open telemetry is 100 backwards compatible with open tracing we really care a lot about backwards compatibility not just between open telemetry and open tracing but also with any future version of the open telemetry apis part of the reason the apis are separated from the sdk is to limit the surface area there so we understand specifically what backwards compatibility looks like for instrumentation code it's really a big priority for the project yeah no problem david i really wish they let other people speak it's kind of weird well anyways i suppose i'm going to hang out here for another seven minutes or until they kick me out so if you have any further questions please add them to the chat yeah out of curiosity is there anyone in the room who's currently using open telemetry or who's tried it ah okay david's got a good question how does this all tie in with log monitoring analytics apm etc is open telemetry is sort of the future consolidation of all these older disciplines capabilities into a single pane of glass sort of it really is about having the right context and the right indices that's to my mind the primary difference between open telemetry and pre-existing logs and apm solutions obviously all of the data within open telemetry is structured so it's better than unstructured logs but compared to say a structured logging system the distributed tracing and open telemetries not really any different except you have all these amazing indices right you have a transaction id and without some form of context propagation there's really no way to get a full transaction id um you can get request ids that will cover from one hop to another but if your system really starts to sprawl you find you end up having to close together like a set of ids just to kind of see what your transaction looks like so having access to those extra indices is really what pulls distributed tracing ahead of run-of-the-mill structured logging in my opinion and likewise for metrics the open telemetry metrics api i think is actually really interesting i think it's a step above what we've seen in the past it's definitely scratching some itches but the key thing is that you can actually index your metrics uh with a set of correlations that are much more interesting than you might be able to do in other systems where you didn't have context propagation or access to these different contexts so that's that's the main difference so for example an apm system without distributed tracing can give you a lot of information about one service and what that service is doing but it becomes hard to piece together a complete picture of your system once your transactions start to involve two three four services so david asks so is it the glue that allows all these uh sides of the prism on observing the operation of services to have better insights yes yes exactly it's really about improving data quality that's why open telemetry is really a telemetry system there's no analysis portion of the system the idea is just can we cross-index all of these different streams of data and produce a canonical data format that just has all of it in it so you can consume it as a fire hose and that's kind of novel in the past systems tended to be one of these pillars right so there wasn't much of a point of a combined data format that had logs and metrics and everything in it but now that you're looking at systems growing to be more like actual full-on observability platforms where they can ingest a wide variety of data and find interesting correlations and insights across these different data streams i think that's what open telemetry is enabling and not just for existing you know vendors or whatever i think it's going to make it really easy for people to experiment and build sort of one-off forms of analysis because one of the big barriers to entry is having to build this whole honking telemetry system and all of these integrations and instrumentations just so that you can experiment with a particular form of analysis and not not having to rebuild any of that is really one of the things that um i think will accelerate uh development of analysis in the observability space so that's actually what i'm excited to see in the next couple of years is all the crazy things people start building on top of open telemetry hope that answers your question cool beans so we're at 12 30 which i believe is the end of this session so i'm going to log out of the video chat now i think we're done thank you everyone uh it was super awesome uh please dm me on twitter if you want to continue the conversation all right have a good one y'all
Info
Channel: Continuous Delivery Foundation
Views: 2,411
Rating: undefined out of 5
Keywords:
Id: 1vMu7iskQaY
Channel Id: undefined
Length: 28min 16sec (1696 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.