OpenTelemetry Deep Dive: Golang

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right let's do this so um welcome to getting started with open telemetry so let me get rid of these questions here and then let's go so open telemetry super high level overview what why did we even call the project open telemetry so the reason we did that is telemetry is an existing word and what telometry means is it means generating and transmitting uh instrumentation signals about a remote object or process and sending it uh somewhere where it can be analyzed and that really describes the boundaries of the open telemetry project it's really about the generation and the transmission of these telemetry signals we believe this is the part of observability that can be standardized and that would uh everyone in the field would really benefit from from working together to standardize how programs describe themselves and how programs are observable in a way that's somewhat language neutral so i want to be able to look at my distributed system which is a polygon system made of all different kinds of components and see some kind of coherent story for what it's doing and that's really the the bread and butter of open telemetry so let's go straight into just the big pieces of how the project is is laid out so if you have a process represented by this green circle here um and you're going to install open telemetry in it the first thing you're going to install is the sdk so the open telemetry sdk is the implementation part of open telemetry so this is the framework part um when you start your program up the sdk is the part that you configure uh this is where you would install plugins and lifecycle hooks [Music] and you only want to do this during program setup though because the thing about all of these details is they're all kind of under the hood details um we don't want because we want open telemetry to be flexible and work in a lot of different ways there's a number of details we don't want to necessarily expose to instrumentation or to the application for example what kind of headers are you using what kind of exporter format because all of these things are flexible we don't want people to depend on open telemetry working in one particular way so the sdk is where you set up all of that flexibility and do all that framework work but once you're done setting that up during program initialization generally speaking you should not be touching the sdk anymore after that point in your actual application code where you're doing instrumentation you want to touch the api so the api layer is the kind of the opposite of the sdk it contains no implementation it's almost entirely made up of interfaces it also has some constants and other tools to help with data standards but it doesn't necessarily contain an implementation in fact it can support multiple implementations besides just using the sdk as a framework for plugging in different components you could actually toss out the entire sdk and plug in an entirely different sdk under this api uh so so they're it's truly taking the the concept of um you know loose coupling between large components to to have a clean observability api like this and it's actually really important for something like observability because observability is a cross-cutting concern uh all of this observability code gets mixed in with all of your other code and so that's really a hallmark for where loose coupling can be helpful because you don't necessarily want to be tying all this stuff together to the specifics and by the way an example of multiple implementations and why you might want this a classic example i like to use is a c plus implementation or these days maybe even a rust implementation but rather than using a language implementation in some languages especially dynamic scripting languages may be a lot more efficient to have an sdk that was entirely just foreign function calls to the c plus plus open telemetry implement implementation so by having the api and sdk separated like this that allows a lot of this flexibility going forwards into the future to come up with better more efficient ways of doing this stuff okay so the next major component is actually not a component of open telemetry but it's all of the code in your system um we'll get into this but there's an aspect of distributed tracing which requires everything to be instrumented because you're trying to have some kind of uh context everything being contextualized properly so with logs or with just metrics you can just make a standalone log line or you can make a standalone metric there's really no such thing as standalone tracing so when you go to set up your program you really need to look at all of the libraries and frameworks you are using uh to manage network connections and control flow within your application so the most common one here is going to be your framework right that's either a web framework that all of your application code is going into in which case that web framework contains just a huge amount of information and it's doing a lot of work on your behalf so getting that thing instrumented um will give you a large amount of coverage uh but you also want to look at frameworks that are literally controlling control flow so for example um something like akka actors or any sort of co-routine implementation anything that's swapping threads or moving work around or changing the context of what work is active that framework component needs to be integrated with open telemetry uh in order for it to work likewise for the network connections if you have an http client or a database client or anything like that those are critical libraries to have instrumented again not just to get the information out of your system but to also propagate this context so that's really critical um open telemetry comes with a number of plugins for popular frameworks and libraries luckily for the most part we share a lot of this code people don't tend to go write their own http client just to make a website people don't always make their own framework though they sometimes do so this is where the sort of auto instrumentation aspect comes in we want to have instrumentation for all these common components ready for people to use out of the box okay so that's everything that's happening inside of your process you've got your sdk you've got your api and that's talking to a bunch of instrumentation code some of in your application but most of it in your libraries and frameworks what happens next what happens next is that data gets exported off of your system so um the sdk has a plug-in model uh it comes with a number of uh standard exporters so otlp is the um open telemetries uh default protocol this is the protocol we invented for open telemetry it contains all of the data most of the existing protocols tend to be separated into verticals right so they contain tracing data or metrics data but they don't necessarily contain both and otlp contains all of the data so that's the primary protocol to use for exporting the data out of your system um but you can install an sdk level exporter if you don't want to use otlp uh you can use zipkin prometheus jager there's a number of standard formats that are built in and as ariel as earlier is it possible to create custom exporters the answer is yes this is all a framework system so you can totally write your own exporters if you have your own data format that you want to use um all of that data then gets sent to a separate process called the collector so the collector is the other major implementation component inside of open telemetry besides the language sdks you've got this collector which can run as a sidecar or as a gateway and what that does is it's sort of like a data processor that's the place where you're going to be translating between data formats you're going to be massaging your data in various ways um it's a really useful component and so that's sort of the transmission part of the telemetry system the api and sdk are the the the generation part and the collector is the transmission part one thing you will not find in open telemetry though is analysis so the idea here is open telemetry connects to a bunch of different analysis systems but we don't really want to replace those um so open telemetry doesn't come with its own analysis system and by analysis system or analysis tool i mean something like lightstep or datadog or jager or the zipkin back end anything that's going to store all of this data and do something useful with it is an analysis tool and the reason why that doesn't come in with open telemetry is because open telemetry is a standards project we want to standardize how we're sending this data but there's not really such a thing as standardizing how we analyze this data analyzing this data is actually where a lot of innovation is going on currently so the hope with open telemetry is that by getting everyone to standardize on what the data is and how we're transmitting it that will actually really create a boost uh in the realm of creating new analysis tools because if you want to create an interesting form of analysis for this data you can just go do that now you don't have to write all of these other components and back in the day if you're writing a distributed tracing system you would have to do that you would have to write you know the collector the sdk you would have to write all the instrumentation which is not complicated code but it's a huge amount of code and so by standardizing all of that um we're really hoping to see analysis take off and to see sort of a golden era of observability come on so that's that's really the the high level goal for the project now how we actually make all this work in all these different languages and have some kind of unification is something we call the spec so i'll walk through the actual github organization so you can see this but basically we have a language neutral spec where we define all of these things um and then from that spec we then create implementations in all these different different languages and i see there's a question here is the collector does a collector equal the light step satellite okay that's a good question so no it they are not equivalent uh for people who don't know lightstep we had something that sort of looked like a collector called a satellite um the difference between these two things is that the satellite actually is an analysis tool so if you're running light step with open telemetry um you're still going to want to run something like a satellite or you know a hosted version some version that we run for you at one way or another there's something like a satellite involved when you're talking into light step because the satellite is actually part of light steps database architecture that's sort of the live in-memory fast layer of our database where you can access all of the um live traces basically you have access to your complete live system and the collector is not any of those things the collector is purely data transmission so the collector you're using to send the data from point a to point b to buffer it um and deal with you know network flakiness trouble components coming and going and it's also where you would um do like data massaging scrubbing sensitive data out of your system etc etc and you don't really do any of those things with the satellite the satellite is really like a database component so hopefully that answers your question okay let's switch gears and do a github walkthrough now just to give you a sense of this project but first uh any questions at this point about what we just went over here i'll go back to the last slide so you can see it and github has a capital h my bad i put a capital m a capital s in microsoft the other day and got a look so i'm always screwing that up okay so doesn't seem like too many questions right now but feel free to just throw them out there as we go so let's just look at this project on github so if you go to opentelemetry.io you know you'll find our website but all the good stuff is actually here in github for the time being um so let's just have a look so if i pop up to the highest level of uh the open telemetry organization there's a couple pinned repositories that are super useful uh if you're just trying to get an understanding the project uh a super important place to start is community so the community revo this is just where we keep track of how we do things so if you want to understand who does what in the project and how the project is actually set up we have a governance committee we have a technical committee um we have various roles in the different projects like maintainers and approvers and things like that and you can read about how all of that stuff works here uh we do a lot of meetings we we talk a lot on github but we also like to have every working group meet once a week so you can find the open telemetry calendar i highly recommend that calendar it's really great to show up to these meetings if you're interested and we also have a set of mailing lists that we basically don't use we're really not very email focused but they are here if you want them and mostly we hang out on gitter if we want to chat so um there's a couple of getter rooms that you can find here so the easiest way to actually just get involved in the conversation and ask any kind of question is to hop on to getter find the room for the project you're interested in and ask the question there we're on getter all day so that's really the best place to go okay so that's how you learn how to get involved with a community how does the actual spec work that's in the specification um this can be useful to do a read-through of at some point i mean this if you're trying to put yourself to sleep uh it's good for that um but uh it does really go over a lot of the details so if you're trying to understand what some of these terms mean or like what the point of some of this is the um you know the api docs at this time don't tend to go into that a lot but you can go here and look at the api specification for tracing and this will tell you a lot of detail about what what the point of this stuff is so not just the overall structure but also like what are the reasoning behind some of these parameters you have to pass and what do they mean a lot of this stuff is not necessarily super documented in the api docs because the project's still in beta but you can find a lot of the information in the specification so this is actually useful resource even if you're not doing an implementation if you just want to wrap your head around how open telemetry works i do recommend looking through the specification in particular these semantic conventions this is where we're trying to define conventions for how to describe common operations like http calls and database calls and stuff uh it's good to get an understanding of um of how these data semantics work in particular like if you want to describe an http request this is about you know you call the method http.method and it's a string and it's in all caps and it looks like this we're really trying to standardize this stuff as much as we can so that uh analysis tools can be presumptive and can have some semantic understanding about what this uh what these traces are it's one thing for a system to just see like a trace or a span or an event it's another thing for analysis tool to see an http call and so these semantic conventions are how we're trying to get those um those high level concepts into the data so that there's actual meaning behind this data that these analysis tools can make use of so that's a good reason to go look at the spec uh if you're interested in how changes get made to the spec we have an rfc process um so oteps here is what we call them the open telemetry enhancement protocol you probably won't be making one of these if you're not working on it but just to let you know this is how change gets made in the project because we have we care a lot about backwards compatibility there's so many implementations we do want to have something like an rfc process that gives a chance for everyone to weigh in and work it out before we go baking all this information into the spec so there's a proposal process you can look through you can see we've already had a substantial number of of otep so far and so that that's how the project actually works the community repo the spec repo and oteps are like the core of making change and defining the project and then the proto repo that's where we're defining otlp data protocol okay that's the project overview any questions about that part doesn't look like it it's pretty straightforward okay so let's move on a bit um to just cover some of these uh core concepts behind distributed tracing and open telemetry in particular um so just a reminder of the kind of transactions we're trying to capture here because open telemetry is all about uh distributed transactions there's a lot of different kinds of workloads and things you can look at in this world but distributed transactions are kind of the bread and butter of cloud-based computing technology and so just for clarity what we mean by distributed transaction is you know you have some client let's say this is a mobile client for a service that lets you upload a photo along with a caption okay great so this client is going to make a network call to a service of course it's not going to be one service it's going to be a whole pile of services right like it might hit a proxy first this isn't going to talk to an auth server and then maybe that proxy writes a photo down to some scratch disks and then it calls your application and then your application processes that thing and uploads it all to some kind of cloud storage once it's uploaded it writes down what it did in a database by calling it data service that keeps it information cached and redis and then writes the rest of it to sql so i wouldn't even call this micro services i feel like i've been looking at like lamp stack apps that are basically this for like 20 years and this is the thing i always point out is like even back then we had this annoying problem of there's a whole bunch of services involved and if i want to reconstruct one of these transactions it's kind of tricky to do because there's not a lot of context right so just to to give a sense of that um here's another way of looking at so here's like a service service map a description of this transaction this is more like a call graph description of this transaction so when you're looking at traces you tend to look more at a display like this where the length of the line represents how much time that operation took each line is an operation each of these connecting lines is a network call and the colors just match the services here so you have your client this of course the client is open for the entire transaction it talks to the proxy the proxy talks to your auth server the proxy uploads the file talks your app server blah blah blah so that's what's going on there and things that we can find when we look at this kind of contextualized data first and foremost is latency right we can see not just how long things are taking but we can see where the time is being spent we can also see where time is being spent waiting so you can see this this client you know it takes a long time but the client is only doing work for a very short amount of time in this transaction for the most part the client is sitting there idle waiting uh for the transaction to return and the work is actually being done somewhere else and figuring out where that work is actually being done is what we refer to as a critical path so that's the stuff that's really gonna let you hone in on what to actually optimize because there's no point in optimizing something that's just waiting or something that's not really going to affect the overall latency of your transaction so that's one thing we care about the other thing we care about is errors where did the error happen did it happen here did it happen in this component happen in that component uh that's really important and then of course events events are just logs for some reason people make a big deal about trying to say that distributed tracing and logging are different i don't subscribe to that view i actually think distributed tracing is just logging it's just logging with all the context that you actually want in order to analyze the data that's the only real difference but of course you're going to want these events these are just your log events that tell you all the little steps that actually went on in that transaction and then last but not least of course correlations this is like the big big big win of open telemetry having all of this context is that you can correlate these events really easily not just the events within a single transaction but across transactions so this error you're seeing are you seeing this error um does this error correlate with a particular project id does it correlate with a particular host or region is it a particular end point [Music] what what are the commonalities with an error so so those kinds of correlations are really important to root causing your issues and if your data is not contextualized properly then doing this kind of correlating is just really time consuming and labor intensive at least that's what i've found another way of thinking about this is you have this issue of like the logs that you have versus the logs you want you have all these logs across all these different services and if what you're trying to do is find just the logs that were part of that one transaction if you don't have some kind of id that indexes all of those logs by transaction id then it can be really hard to piece even piece together a single transaction that can be kind of tricky to do and if you start trying to get that transaction id into every log that really just takes you down the road to distributed tracing and open telemetry and that leads us to this core concept which is context propagation this is the thing i really want people to understand about the open telemetry model i don't think you can really debug or reason about distributed tracing or open telemetry unless you understand the basics of context propagation so we're going to go into this and then switch gears a bit so what do i mean by context propagation okay so you have two services right and these services have a transaction that connects them so you've got some operations and service a a network call and then some operations in service b within a service you have context so what context means is these operations represent a call graph and each operation is contextualized there's some kind of environment or context occurring that's specific for that transaction it might be specific for that operation and that is following um the path of execution so you always have this context available when you're executing code so that's one part of open telemetry is making sure this context is following your code and always available uh if you're a go programmer this is super normal right go literally has a context object it's one of the few languages that has this as sort of an explicit first class citizen you know for better for worse you do it all manually but i do think go programmers understand context uh this is actually harder to explain to programmers in other languages because unless they've been mucking about with red locals or something it's just not common for goat programmers this is common so you know what i mean when i say context it's really the propagation part that is new for go programmers because what you're doing with propagation is you're taking this context or some portion of it and you're serializing it as metadata so if this is http the metadata is going to be http headers so right when you make the network call on the client side you're going to inject this context into your http headers and then on the server side you're going to extract that context from those http headers and resuscitate it as a new context object that keeps on going so the core concept behind distributed tracing and open telemetry is basically imagine if you had go's context object and you made it distributed so anything you put in there in one service might be available to another service down the line so that's really the core of what open telemetry does and all the observability that open telemetry does is built on top of this principle of storing all of these useful indices and data structures in a context object and letting you manipulate it behind the scenes so speaking of propagation uh i just want to let you know there is some other standardization going on here outside of open telemetry um inside the w3c which is the group that standardizes http and html and all the web standards a number of us mostly people who have been working on the open telemetry project including me have been working on trying to get some actual standard headers to do all this stuff added to uh the http specs so uh one header that's already well along it's called trace context so this has two fields trace parent and trace state um if you use b3 it's very similar but this trace parent header this is this is what really has the main goods you have your trace id which is basically that transaction idea is talking about and you have a span id which is an operation level identifier so those are your your core contextualizing identifiers there's also a thing called trace state that's kind of internal for tracing systems you don't really need to worry about that thing in addition to trace parent there's another set of headers coming along called baggage and well trace parent is tracing specific so there's some actual meaning to this ids baggage is totally arbitrary it's just arbitrary key value pairs and so this is that generic context propagation i was talking about and so this collection of headers um by standardizing on them and getting all the software in the world to accept these headers and propagate them and interoperate that's really going to take a big pain point out of uh distributed tracing which is connecting all this stuff up and not having some official way to do it um okay yeah so getting into sdk setup we do recommend go python javascript and java at this time i should warn you everything is still in beta so there are breaking changes still happening but i do consider this code to be production ready you just may have some api changes to contend with in your application code and as part of doing the setup and we'll walk through this together after you get all set up you want to make sure that you're actually verifying things this really should have been part of the rolling out stuff later i'm not going to go into this now we can go into this later uh let's just walk through the code i feel like i've been talking forever let's get some hands on some keyboards but before we do that are there any questions at this time i see i've got some stuff in the q a already so one question coming in is in your experience has it ever happened that baggage is slowing down things in the application how do you identify that yes something to mention here with baggage is that it it's going to add bloat to your http requests right like obviously the more http headers you add the bigger that request is going to be um so that will eventually cause some slowdown um if you're adding uh too much baggage so there is like that qualifier with baggage is you do have to think about it uh it's not free uh you are going to steadily make your um payloads larger the more baggage you add but besides that it doesn't really um i don't think baggage in my experience adds a lot of latency open telemetry of course does add some latency you know in general but not uh not baggage in particular so the best way to identify where baggage is causing your problems is just to have a look at the size of your http headers if you're noticing that you're ending up in a situation where you're sending say tiny payload messages with just a huge amount of header information to do the tracing i could see that being a place where baggage actually caused uh some latency issues so that's how i would look into it okay do i recommend doing both tracing and log logging separately inside the app i have seen cases where tracing doesn't give a clear picture if i want to focus on a sequence of events a particular service is doing that is interesting i'd love to dig into that more um i personally see tracing and logging as as i mentioned before it's the same thing the only difference is traditionally people have tended to use sampling in their tracing system and not use sampling in their logging system that's part of why i think people sometimes get the impression tracing isn't great for root cause analysis because yes if you're sampling everything you're not gonna it's not gonna be an effective uh auditing tool um but i do think that form of preemptive sampling is kind of getting phased out a light step doesn't work like that at all we take 100 of the data um so there's no sampling going on so in that case it's no different than logging and i think you're going to see a lot of other systems moving to something similar um so personally i just see um tracing as as a form of logging with better identifiers but if you do have a logging system that you want to keep using what i suggest is making an open telemetry plug-in to attach all of those tracing identifiers to your logs so if you want to keep using your logging system you can do that and tracing can improve it by get adding you know trace id span id um if you start stapling all those things onto your log statements um then you're going to improve your logging system because it's now going to effectively be something like a tracing system it just is piggybacking on another system doing the heavy lifting of propagating all this stuff uh so i don't think it's necessary but as a transition yes you can totally use logging and tracing together okay next question when do you expect to have a ga for goling and or ruby oh man good question we are so terrible at predicting deadlines in open telemetry every time we predict deadline we're just like ridiculously wrong um the project has actually shaped up and become a lot more organized so one place if you're really caring about um how fast things are going you can look at projects um and we do have uh there's a go release candidate uh project that you can follow this will be probably the most accurate thing as far as go getting out the door um and then you can also have a look at the ga spec burn down so in order to ga any of these implementations for suspect has to get finalized and this project follows all of that ga work so you'll notice we have a lot of tagging going on so if you look at anything you'll see things are set as their priority they also have markers on them as to whether they're required for ga or not um so you can actually get a fair amount of information from from these backlogs if you're you're interested um but ballpark we're hoping to have um this tracing spec uh frozen uh very soon in like a week or two and i'm hoping to at least have a release candidate for the stable release candidate for the tracing systems uh by end of year uh we're hoping by end of november but you know i'm i struggle to to really push an open telemetry deadline when we've blown through them all in the past so hopefully that's helpful um do all systems have to use otel w3c contacts it's a way to migrate yes yes you can migrate there's not a requirement that you use the w3c headers all of that is totally configurable we'll show you how to configure it um you do have to make sure all of your systems are using the same headers to propagate otherwise propagation's not going to work um but uh one way you can do migrations is you can actually install multiple propagators um so there's a thing called a stacked propagator so if you would like to receive say w3c a new header plus an old header you can use these stacked propagators so um what that says is i will go in order and if i see a w3c header i'll take that if i don't i'll look for b3 headers if i see those i'll take that and then i'll look for you know ot headers or something like that uh and then likewise when you go to inject and propagate those headers you can have a choice of propagating one header type or even potentially multiple header types if you want to propagate both w3c and a legacy header at the same time you can do that obviously that's going to blow the payload of your requests but that is one way to have a zero downtime transition is just to propagate both kinds of headers until all systems can accept the w3c headers and then you can remove the old header type that's how i recommend taking that approach okay can it be controlled dynamically say sampling 10 of traffic and increase it as needed yes sampling can be controlled but i should point out sampling is actually um analysis tool dependent so this is actually a foot gun in open telemetry it's one of the only ones which is you can't install sampling plug-ins and things like that but if the tools you're using to analyze that data aren't aware of the sampling that you're doing and are not compatible with that kind of sampling you're going to screw up your data those tools are expecting a certain kind of sampling or no sampling and so if you do layer something on top of that just be warned that may not play well with your analysis tool so really the way sampling should be approached is to just sort of do whatever your analysis tool wants you to do and then um possibly layer something on top of that as like an alternate system if you want to have two ways of doing it at the same time that's where having a framework is helpful you can install multiple spam processors multiple exporters things like that more questions can we share attributes between parent and child spans when using a logging exporter [Music] um sure what this means sharing attributes between parent and child spans when using a logging exporter uh you can i'm guess i'm not totally sure ariel what you mean by sharing attributes between parent and child spans um so maybe uh if you could clarify that question i'd be happy to answer it and then you're you're parent has a username tag yeah i think i know where you're going with this ariel yeah uh yes so this is a trick we'll get into this at the end of the talk but um yes this is definitely definitely an issue um this again is somewhat analysis tool dependent so the way the data shows up is you know the attributes are only going to be on one span so if you put an attribute like username on one span attribute is not automatically going to be put onto the child span or anything like that however some analysis tools may take all the attributes and do trace level indexing so you can say hey find me any trace that has this attribute but that's actually a pain point right now um a lot of systems i think including lightstep don't have a great facility for looking for attributes across uh spans so find me the trace where span a has attribute b and span c has attribute d that depending on your analysis tool that can be tricky to do uh so i'll get into some details later but i believe the solution to that is you should actually have coarse grained spans there's not a lot of advantage of having lots and lots of tiny spans unless you're really trying to measure um some specific latency problem uh you should favor coarse grain spans and get all of your attributes kind of pulled together that creates a better indexing scenario in my opinion okay one more question stack propagators are only available in hotel sdk right not in lightstep sdk oh no we've got um in some of the old lightstep stuff we do have stack propagators i think it depends on which light step client you're talking about because those those are a little lumpy they're not all the same um but that's that is a concept that's that we actually came up with that concept and added it to open telemetry so um they're they're definitely around uh my memory of our our go implementation is a little fuzzy though i'm not i believe we've got stack propagators and go but i don't totally remember okay that was a good question session just having a look at our agenda here so we look like we're doing well on time which is surprising for me i'm either totally digressing and going really slow or speaking a mile a minute and we're like twice as far as we should be so we could have room for oh no arrow you're not derailing the presentation uh i i really want to answer all these questions i actually think these um workshops are are more helpful with questions i think if one person has a question probably a number of people have that question so uh it's worth answering them um and uh i i'm a little bit sad that we can't actually do live talking on this so that's really how i prefer to do it but we definitely have time for a couple more questions at this point or we can move on to the code walkthrough well let's do a code walkthrough and then we'll we'll take a break after that and then we'll come back for uh more high-level discussions about you know how should i roll this out how should i adopt this that kind of stuff okay so code walkthroughs live code always a fun time so i'm gonna go here i'm gonna make new dir i'm gonna call it walkthrough and then i'm going to just say go mod init and let's start my code editor here so we're just going to make a simple client server example um so first i'm going to make a server okay your basic go server see if i can get code a little bit larger if you're having trouble reading my code please speak up and let me know um and by the way you can also follow along on this if you go to tedsuo hotel go basics um i've got this all typed out so you can just play around with that go code as well um so hopefully you've got a light step count made at this point and we can get started so how does this work so let's just make a quick example server uh for starters right so if we were to make a server you know you're just going to say we're going to have some funk let's make hello world so a hello handler right typical stuff right and so what do we want to do here let's just have the world's dumbest thing and we're just going to write out you know hello world and then let's just add this here so dp dot handler let's just say listen sir oakley and i did something silly i need to add a route there we go still mad at me undefined new handler is that really true and i have these backgrounds there we go a little bit in a little bit alrighty so that's your basic server let's um let's just try running it so go run come on it should install some stuff and it's probably running so if you go to local 9000 hello there we are okay world's most basic thing all right so how do we add open telemetry to this so something me and my team have been putting together at lightstep are some nice convenient uh wrappers for sdk setup right now sdk setup um works very well but it's not packaged up uh it's well factored but when you're just trying to get started with like the basics it's helpful to have some packaging there and in fact i think open telemetry is going to move towards a distro model where you'll have a number of open telemetry uh distributions that have pre-installed um plug-ins and exporters and things like that i think something like that will make a lot easier to have open telemetry connect to different backends but anyways let's get this connected up to lightstep so to do that you wanna check out the lightstep um launchers so these launchers are just things that make it really easy to get started so you've got our launcher you can go get the launcher and then it's really easy right so just to get started the launcher is going to tell you to configure open telemetry there's two required components uh if you're talking to light steps so with service name you really need to name your services so let's call this hello server and then if you're connecting to light step you need an access token so let me show you about that so if i go into lightstep.com so this is my account you can see some old runs of all this stuff hanging out here but if you have a lightstep account to get your access token you're going to go down under settings and you're going to find access tokens here you're going to copy that access token and paste it in there so that will get open telemetry installed but it's not going to do anything because we haven't added any instrumentation now in some languages that allow some amount of dynamic programming we do try to auto install all this instrumentation by detecting what dependencies you have in your dependency graph matching those with instrumentation libraries that we have and then installing those instrumentation libraries for better or worse uh goling does not take uh that dynamic approach anyone who programs and go for a while knows that you know the language is really against magic and against kind of spooky things happening in the background so there aren't really any uh facilities for doing that kind of automation and go because it's just not to go away the good way is that you should take the extra time to write it down and then it's all written down in obvious so other people can follow it okay that's fine that does put an extra burden on people trying to do distributed tracing and code though because you do really need to make sure that you install um the proper set of libraries and plugins and by the way quick pitch for my docs if you go to open telemetry.lightstep.com i have a walkthrough here that you can use and that will cover a lot of this stuff as well so i highly recommend after this workshop um using this this is your kind of go-to for getting started i say that because i wrote it but in terms of library and framework support you can find them here in open telemetry go contrib this is where all of our current instrumentation lives and so you can see it right here so there is a pile of popular stuff but it's not everything so the first step is to really make sure that the stuff that you want is actually uh is actually represented here and if it's not yet then getting it written is probably the next step so you could contact the open telemetry crew and maybe even make a contribution on that front uh adding instrumentation is definitely a place uh where we're looking for help and it's an easy place to plug in but in this case we're just using our vanilla net http stuff um so you can have a look at this uh let me see i think i had a link what am i doing i did want to link to those docs so look for hotel http here here we go so if you want to get started with http and open telemetry the hotel http package is how you um kind of connect this up and wrap up all of your http stuff so in this case we can have a look at this this handler func rather than adding it straight in we're just going to say hotel http and say new handler there we go pop that in there call this ted server just so we can find it in the background but that's basically it right so we've now installed open telemetry and we've added our http instrumentation so let's just see what this looks like so now if i start this server up i go to localhost hello hello world i'm just going to hit refresh here a whole bunch of times to see how this works and then i'm going to go into light step and see if these show up and here we go ted server so if we go and have a look then you can see we've got the world's simplest trace right it's a single span just covering um this http endpoint but if you look inside of that thing you can see we've already got a huge rich amount of attributes and data that this instrumentation has added to the span so all of these conventions semantic conventions around http and networking are all here you can see which instrumentation package actually created uh this instrumentation there's just a lot of good stuff here uh and so that's that's the level of instrumentation you would get out of the box for an http endpoint which is really really quite good um but let's have a look and see if we can go about adding decorating this with some more information so if you want to decorate this with some more information this is where we're actually going to need uh to create a tracer so the way you do this to actually trace is oops oh god what are you doing okay so when you create a tracer in in go you want to create it at the package level and when you do that you want to name it after the package so in this case we're gonna name it main like that uh and the reason you name it this is where this information about instrumentation name comes from so every tracer is named by the way you're not creating a new tracer instance when you do this you're literally just naming it and that name should match the package that uh that generated that data and by doing that that means anything done by this particular tracer will get will uh get attributed to main so you can just see if you're trying to hunt down where this instrumentation came from this is your hint so let's actually try giving that some work here so um one thing you're going to need of course is a context object rec dot context is where we're gonna pull this off of and then to get your current reactive span you can just say span text you could if i knew how to type so right now to type there we go so again you're seeing the importance of context here to uh open telemetry so everything is context based you'll see that like almost all the functions take context as an argument but the span is the currently active span this represents the operation that we're currently in and so this is the span that was created by this handler wrapper right here so the span we're going to have available uh is that span that you've already seen here in the background so that's that's this span right here so once you have that span then you can of course do whatever you want with it so let's say attributes right and then we're just gonna label them you can say this string the key is going to be project id and then the value one two three so that's adding an attribute to this thing let's give that a shot go over here to my local host refresh this a whole bunch go have a look in here grab one of the latest and you can see project id showing up right here so it really works so um that this is a good place like this right here really is enough setup to actually play around with the api and learn a lot about it another important thing is recording errors so span like error is important it takes context so if we do that record an error let's start this back up refresh this a whole bunch and you can see that an error has been logged onto this but you'll notice that it hasn't set up mark this span as an error in order to do that you actually have to change the status code so just adding recording an error doesn't automatically means the entire span or the operation has failed or in an error state there's like many other reasons why an error might happen but if you are like oh my god this this means that there's a problem then you can set the status let me see what's the api for this so it's just the code in the message so that's easy enough so open telemetry has status codes the only code you really care about is error and then you can send a message so by setting the status to error running the server again it's gonna be a lot of fun watching me go through all these steps refresh on that a whole bunch have a look at it in light step and you should see this actually be marked as an error i'm not sure why it's not doing that that's actually a thing i was noticing yesterday uh there does seem to be something slightly funky right now um with recording errors but we expect that to be resolved in like a day or two and just so people know why that's happening it's actually because the um we just changed how the status codes worked in open telemetry open telemetry used to have a lot of status codes and people didn't really like them because they weren't sure how to actually apply them they were sort of adopted whole hog from a set of status codes internal to google and we sort of did an about face on that at the last minute and decided rather than having lots of status codes let's just have uh two status codes error and okay so by default this fan is status is unset you don't need to set the status at all unless you think it's an error in which case you set the status to error and the only time you would set the status to okay is probably inside of a collector or somewhere else where you're trying to explicitly suppress some kind of error reporting so okay is a way to send a message to your analysis tool that um even if this would count as an error or even if you were going to raise an alert on this thing or do something uh please don't uh i am signing off that this is okay so it's a way to do error suppression um but i think as a side effect of that uh we're not currently picking up on uh picking up on those errors so my apologies but normally that would work so you record error and then you set the status for that so that's how that stuff works so we've looked at setting attributes we've looked at recording errors let's just look at events we can just do span at event and it's the same deal context and let's have a look at that just to prove that that works go here refresh refresh refresh go have a look in the explorer see some new things come in and you can now see the event is showing up here as well i would love to improve light steps display for events by the way if you have opinions on that please let us know so that's it those are the basics everything you're going to do on a span pretty much can be summed up right here so you're setting attributes these are indices for the entire operation to be looked up by your recording errors and setting statuses as needed and then uh you're doing ad event basically as a replacement for logging um and you'll notice some of these things that they're pretty verbose it can be kind of verbose sometimes to write some of this stuff so i do recommend wrapping up a lot of this in some kind of convenience layer that you use internally i find that that can save a lot of typing we'll probably add some more convenience code to open telemetry in the future we just didn't want we wanted to get the stuff we knew worked at the door before we got all crazy with adding sugar to everything um so uh ways to actually make uh this api have be like a little less verbose and have a little a better hand feel for application code for go programmers on the call who are interested in that i'm personally interested in it so feel free to reach out to me directly if you have some ideas on how to like slim down this api as like say a helper layer on top of it um but those are the basic uh span span operations uh we're going to get into child spans and things like that in a second so actually let's just go straight to there so we've set all of these attributes and events on an existing span but what if we want to create a child span and put all of these on that instead so to do that you would just grab the tracer and you would say start and then you would take the current context and you would add on top of that oh gosh trying to remember i think it's just the operation name yeah so another operation let's say we want to have a subspan and then um this is going to give us back a new context and a new span and then you always want to close the span this is number one gotcha in open telemetry is hanging spans forgetting to close the span this is like one of the few places where we have state that you have to track and open telemetry so that's a gotcha but luckily you know if you're using functional closures and the like you'd defer is your friend here so by adding this new span what we're instead gonna see is now when we restart this we're gonna see that we're gonna have all of those attributes and events we made put on a tiny child span oops didn't wait long enough here we go oh and we're getting our errors finally sweet yeah okay so here you can see we have our initial hello span and then we now have this child span that we added and on this child span we have um all of these things that we added so you can see that the instrumentation comes from main as opposed to this instrumentation uh which comes from the plugin you can see the attributes that we added and you can see the events we added so that's how you create a child span um quite simple that's really all there is to it um hopefully i explained enough of the open telemetry basics i just realized i didn't really explain what a span was and i'm hoping people showed up maybe with an understanding of that um so i'm going to move on next to writing the client version of this so we can see the distributed part of it working but this would probably be a good time to stop and just have a look at what kind of questions people are asking so i'm going to go through these questions now if people have other questions now would be a good time to ask them so ryan asks uh what are some of the biggest consumers of otel in terms of collecting spans for apps that process millions of events over minutes and how do they configure the hotel client to handle that kind of load i see so the biggest consumers of otel i mean i assume this is um maybe like what analysis tools have the highest throughput rate in terms of consuming uh information i know uh i can't speak for other systems i work at light steps so i can speak for light step i mean we're you know we're processing like terabytes of data so we we find that that there really isn't a problem uh when it comes to to processing those events from a light step standpoint um as far as uh getting those events two light steps that's where the the collector comes in um uh open telemetry is efficient enough that uh the overhead of generating those spans isn't super crazy and as far as the the work of processing and transmitting them the the bigger your system is the more of a kind of tiered collector architecture you're probably to need so i recommend the first tier is like a side car a collector running as a side car and that's just to get the spans off of the process out of your application process and into another process so if anything weird or bad happens to your process um it won't be affected this also prevents you know memory and overhead from growing because your process is having to hold on to all these spans so i recommend a minimum shoveling all your data out of your process into a collector and then you may have to have a tiered set of collectors operating as gateways as a way to provide some amount of buffering between the system consuming the data and the system producing the data you may need some amount of buffering in the middle just to deal with the operational mechanics of all that and so that's where the collectors really shine because they can provide that buffering that you might need to do in the middle to do things like deploys and handle you know outages and network rollovers and stuff like that so hopefully that answers your question ryan um next question is so basically we'll have to either use the hotel libraries which are pre-instrument or do manual instrumentation yes this is the sadness of go i am so sorry uh in other languages uh it is possible to to automate this stuff a lot more and also use things like annotations and whatnot um in some languages like java you know a lot of it can be truly dynamic but on a fundamental level gold just doesn't work that way go is a very explicit manual oriented language that believes that readability is paramount and if you do a whole bunch of this magic in the background then it becomes harder to read the payoff the or the price of that readability is of course the time it spends writing all this stuff so i i do feel you um this this is the one language i would say of all the languages out there go is definitely the most manual implementation heavy version of open telemetry so apologies um can logging be done in json format uh yes yes there is a json version of otlp in the works um and you can create an exporter really for any format that you want so so there's no reason you can't export in json and then someone else mr ben here uh says is it possible send or export prometheus metrics to lightstep uh it is definitely possible to start sending some basic machine metrics to lightstep uh actually these um uh the go launcher will do that out of the box you'll notice uh it was sending me this this warning that metrics are disabled by configuration no endpoint set so if you actually um go in here and uh add a metric send point to this bad boy which uh this thing right here um uh this will start processing like basic machine metrics like uh cpu and ram and stuff like that but uh we don't have a fully fledged metrics product at this time so so we don't really process prometheus metrics in that respect cool beans all right well those are some good questions let's switch back to uh instrumenting the client so let's add a client to this thing so we're gonna go here new go and let's see so same deal we're going to say package green and we're going to probably just do something super basic we're just going to make a request that endpoint so request and then here so like say client and and then just make a request and say that and we're gonna say get close 9000 slash and then we're gonna say [Music] um we're automating open telemetry installation can we automate error handling while we're at it that's the request we're going to just say client dot do my request am i typing this again okay enough arguments it says but there's so much zoom chrome i can't read what it's trying to tell me oh yeah add that nil there we go okay so this is going to make a request to that same endpoint so we go back here and we run our server here we get a walkthrough and we say um we run our client and run our clients run our client client that should allow us to see basically the exact same thing we've been seeing and sure enough here we go so we're still seeing just the server spans the client hasn't been instrumented yet but you see the client works okay so let's add the configuration so no need to retype this stuff just bring it over we've now got open tool uh the open telemetry launcher installed and then again the same deal we have to add some instrumentation in this case uh hotel http gives us a transport which is the way to to actually hook in where we need to be in order to be doing a lot of the injection and extraction stuff we need to be doing and that's that that should be the only thing we need to add so let's give that a shot so we're going to restart our server run the client a bunch over here and here we go so now when we're seeing spans we're seeing a connection of spans across services right so we have our client span here and then we have our server span here and you can see because our server does literally nothing but print hello world uh most of the overhead here is in the client right so this is where you see the the setup of the network connection and all that other stuff because there's basically nothing happening in this the server we could change that though if we just go back to this server and let's say we want to add like time dot sleep okay i just added some sleeps to that server now if we try to talk to it now we can see because of those sleeps you can see little more where the time is going right so we can we said we another operation was going to sleep for 30 milliseconds we see you're clocking in at 31.7 milliseconds right here so that's maybe a little more reasonable looking trace okay so that's that right that like that's that's really all you have to do and that's because all of this instrumentation is doing the context propagation for you under the hood to give an example of using this with baggage i think this is uh another good way to have a look at this is let's try just propagating a baggage value down the pipe so let's say that we want to let's say that this uh project id attribute here is not something that we have access to uh in this server for some reason let's say that the client knows the project id and we don't want to be making some kind of database call to figure it out here so what we can do here on the client is just um add that so let's see so we're just like context i'm gonna say like hotel dot context with baggage values and then we don't have a context right now so we're going to say context.background and then string d four five six okay so we now have a context that has our baggage values in it and then we need to make sure that gets into our http requests so we have to switch to using the context versions of these things that we can pass the context in there and that should cause us to start sending this project contact context so let's go back over here and we're gonna instead say let me look at my notes real quick um yeah okay so if we're going to go here we're going to get that project id by looking at this hotel baggage so here in the client i'm putting um this baggage into the context and then here in the server pulling that baggage out of the context and then if i want to actually use it here then let's see then we're going to want to say just do it this way just like that oh because it's value so we gotta say label key value there we go okay and i've changed the value to 456 here so we can actually tell that this is working so we go to our server now we run our server and we start running our client we can go back in here to light step have a look here it should work but it didn't why did i not see it here that's interesting so how do we debug things this is a great example of like how do i debug things something's not working i'm curious about um whether this context is actually being propagated so let's have a look at how we can figure this out so if we go into our client i would love to know that these baggage values are actually getting added to this request so it could be like thumped dot printf so first step is i want to know is this actually getting added to the headers now if i run the client nothing in there well maybe that's the wrong place to look let's look on the server side so on the server here we have a request so great let's have a look at this so i'm having a look at my headers i can see three that i'm using the b3 headers for span and trace and i can see that uh this baggage header is not uh is not actually there so let's see let's see what i did wrong so we're going back here into the client we've we've got our context we're adding it to our request with context and it really should be it that's interesting huh well that's a mystery i'm not going to try to unravel right now um because that should actually be working so interesting there i would have expected this actually work and it was working yesterday so what are you gonna do okay well you could see how that would work uh apologies for not actually going through with it i'm not gonna make you sit here while i really muck with it too much maybe later we can get back to mucking with it um but that's the basics as far as a walkthrough goes uh for getting started with spans um so do we have any questions at this time around this span api are there any examples of this i can get into and show people yeah and ariel i'm not sure if if grabbing the bags out of there would really make a difference i grabbed those earlier just to just to have a look at them so a little funky a little funky okay so moving on one thing i do want to mention we haven't really talked about um resources so we've been talking about spans and attributes and errors but you also want to use resources to describe how your service actually runs so you can think of resources as basically attributes but for your service and so you can add those here so launcher resource attributes this just takes a map of string to string and then anything you put in here um is going to go onto your app so my resource key my resource value and here we go oh and we're getting called to do break time well yeah we are over break my bad we should do this so let me just uh yeah why don't we go on break and then um when we come back uh if i solve the baggage problem i'll update people with that and uh and answer more questions at that time so we're going to take a 30 actually let's just take a 15 minute break and come back at 11. does that work for people so we'll see everyone back here at 11 how's that sound bio break time hello hello and it's 11am and we are back hopefully you all are back too so um we can go through your all questions uh but first i figured out what i did wrong and it's really silly um my issue with baggage was i had not installed the propagators so the otel launcher only installs the b3 propagator by default um and has a funky way of letting you add extra propagators so you have to add a cc this is really silly but basically baggage used to be called correlation context and they got its name changed back to baggage so this really should be changed to baggage but right now it's called correlation context so that's a weird little gotcha i actually have an action item now i really think these um launchers should come with the baggage propagator installed automatically because that's kind of unexpected to have no propagator but this is how you configure them so with the launcher you can configure this stuff using with propagators um and if you add this line here uh that will cause uh the instrumentation to actually do the propagation um so if i run that now oh am i not sharing my screen i'm an idiot okay try this again okay from the top so this was my mistake is you need to add just when you're configuring the launcher to add with propagators uh and to include the correlation context propagator i'm sorry this is a little silly this is where the where in beta kind of comes out a little bit i would like to actually see the baggage propagator uh come by default with the launcher because i think that's more appropriate so we're going to get that fixed but if you add this propagator to the server and to the client the propagators are are matching then everything else shall work so if you start the server and run the clients a couple of times and go over here to opens to light step and have a look all the latest stuff coming in and here we can see that our project id is set to 456 which is the value that we set project id to in the client so this is a piece of baggage that was started here so it was a value we wanted to do some indexing with that was available in one server and so we added it to baggage and then used it to create an index later on a later span so that's the primary use case for baggage right there you can do other things with baggage like application level interesting things like feature flags and stuff like that potentially but really right now at this time we recommend only using baggage for observability and not making your system potentially work depend upon it and the reason there is it's easy for something like context propagation to be broken at this time because http services don't do context propagation by default so i don't think you should depend on something like baggage for application for application critical features right now you should only do it for observability that's that's my current take on baggage i do think it can be expand expanded into other really interesting domains um like a b testing and you know security and authentication and stuff but that's like next version that's not where we're at yet um but i did want to show you all that baggage does in fact work okay so that was the actual end of that walkthrough um y'all had some good questions in the q a so i'm going to go over that right now and then we'll move on to more like best practices how do i set this up how do i roll it out um so first question was uh record error generates a log event not a set of attributes um that's correct uh errors are events in the sense that they have a timestamp associated with them events can also have attributes the real difference between an event and an attribute is that events have time stamps and multiple attributes you might think of it so um that's why an error is an event the um the span level attribute is the uh the span status it's not actually an attribute it's a field on the span but setting that status to error that's the sort of span level aspect of recording an error but if you're just recording an error in terms of like i'm just trying to record an event that occurred then record error is basically a convenience function to record an error event so hopefully that explains that bit does hotel define does hotel define a follows relationship between spans as in open tracing so this is actually an interesting question so to provide some backstory here the kind of parent-child relationship we've been showing here is entirely synchronous we're only looking at these synchronous transactions uh so far we're not looking at anything asynchronous for example um let's say you do a transaction like this and as part of this some kind of background processing gets kicked off you want to show that there's a link between the background processing going on and the the original transaction and in open tracing the way we did that was as a single trace that was um used a different kind of parent-child relationship called follows from so that was just an indicator in the graph when uh a relationship between spans was asynchronous so in open telemetry this works a little bit differently um and actually this is an area where i would like to do a lot of work going forwards because it relates to observing message cues and a lot of other workload patterns that are not these transactions but what open telemetry has is it has a concept called links so it doesn't have a follows from relationship it just has regular parent-child relationships in a trace and if you want to link multiple traces together you can do that using the links feature of the open telemetry span so the idea there is for these asynchronous workflows they're actually separate traces so your um your background job would just be its own trace but then you're saying these are these traces are related to each other so that that's the way open telemetry is is sort of heading towards that and that comes out of more the open census model the open census model worked a little bit more like that and there are aspects of how they're using links to handle scatter gather patterns and things like that they looked attractive to us so that's why we ended up going with that model um so hopefully that answers your questions you use links to do follows from okay so ben asks how can a span be manually propagated specifically when using a model other than http server client when using a message queue or other form of pub sub mechanism there's a great question um so what you need to do in that case is one of two things you need to let's just say one thing what i think you need to do is create a propagator for that particular service type so you're not going to be able to use the http propagators but you can write your own propagator to an inject and extract from some other kind of carrier so that model might work if you're saying it's still the same model of i've got this context and i want to inject it into this metadata only this metadata is like uh meta kafka metadata on a kafka message so writing your own propagator is the way to to handle that and you can look at existing propagators to to get a sense of how to do that so basically the the way that stuff works just to point you at it potentially in the docs so if you go to look at let me find them right so here's the propagators package with some examples of what you can find here so um basically you just need to implement extract inject uh fields this is the the basic structure here and you can look at these existing implementations or how to do that and then um it's really just calling inject an extract using um calling injecting extract on the the carrier that you want to use so carrier is our term for the thing you're you're sh injecting or extracting from um so you'll have to take whatever that object is that you're injecting things into let's say that's kafka metadata and put that behind a text map carrier interface assuming that it's it's ascii text if it's binary then you need to use the binary propagator um which i think are still uh that that's all still in flight something like tex max propagators at the time so hopefully that answers so you you need to add you basically need to make your own propagator um and if you're looking at doing this for some kind of something that's common again like kafka or amqp or anything like that i'm really interested in modeling that properly and we definitely want to get that into open telemetry so that's a great thing to come get involved in the community with if that's something you're interested in you're going to discover that there's there's more difficulty in tracing message cues and these kind of pub sub formats than just inject and extract for example you're often caring about not necessarily how much time was spent in each operation relevant to that that particular message but you care a lot about the gaps like what happened between these times when it was you know being run through one of these systems or something like that uh and you also have a lot of these systems have batch modes that they work in that require some thinking about because it's you can't do this easy uh closure approach where it's the transaction if you're instead trying to process uh 50 messages because you have a batch of 50. you have to think about like how you actually like is each message becoming its own trace are you is the trace processing those 50 messages so unlike http and there's more you know ordinary rpc things people are used to message queues come up with like a number of different topology issues uh that i i think need uh some specific thought uh it's you can't just inject and extract and have it be totally helpful though that is a great way to you know get those ideas moving around so it's a bit of a long answer but hopefully that answered it uh feel free to follow up because that's an important question okay renato asks would sampling be disabled for traces that have errors on any span if not is there a way to disable sampling in such cases oh yes so sampling so sampling again is really dependent upon um your back-end system and how it works there are some systems uh like zipkin for example where all they're doing is upfront sampling so they're basically you know flipping a coin and uh deciding whether sampling is going to happen or not up front other systems have what is called tail based sampling so light step is one of these systems so the way we work is we actually don't do any upfront sampling we collect all the data into our satellite layer which we referenced earlier and um based on what we see there we choose to sample or not sample things and that allows us to do things like sample heavily sample errors or identify outliers so we really want to heavily weight the exemplars that we capture to be rare weird things like you know not just errors but also like what's what's um a request that's in like the p99 of latency we want to um prefer trying to capture uh these these strange unusual events and the best way to do that is to capture all the data first and then only choose to to store some of it um but unfortunately you can't really mix and match uh there isn't a way to do this without storing the data somewhere so um we might be able to bake in a bit of this into the collectors at some point but probably not soon again because the collectors only know about the data that's been sent to them they don't have a way to like look at all the spams from other processes that may have participated in the transaction that sent their data to another collector so really that's i would say is analysis tool feature right now is is um having that kind of nuance sampling so hopefully that answered that ariel asks when should we prefer baggage over request parameters ah that is a good question so i would say the main purpose of you should prefer request parameters let me put it that way you should prefer writing normal code the way normal people write code and not get grabby with this observability code um or the kind of um cool uh sort of unobtrusive way things get get passed around it's sort of the same rules you might see for the go context the way um the the gor go core community um references like when should you put stuff in context versus a you know a function parameter it's basically the same it's it's the same idea uh if it's the the point of baggage is that you may need these values way down in your system but the idea of adding in all the code to sort of bucket brigade shovel this value from like one request parameter to the other so you can get it to a service you know three layers deep that's a huge amount of work and the purpose of baggage is to be able to flow um flow these indices and correlations through your system without having to do that work um so basically you should use it just for flowing correlations at this point i don't recommend using it for for other things and obviously uh if you don't need to put it in a baggage if it's easy enough to get it as a request parameter if it makes sense as a request parameter you should put it there hopefully that helps okay the next question is is there a common pattern that you recommend to instrument to a code based more than a million lines of code oh yeah um we will the whole next bit of talking is about that so we're going to get to that question in just a second then are you aware of any studies on the overhead open telemetry introduces service both in terms of cpu memory footprint so we do not currently have uh performance benchmarks for open telemetry outside of um the collector the collector has performance benchmarks as part of going ga though we do want to have um those kinds of performance benchmarks being run on every release so we can catch progressions and stuff like that um so i don't have an answer for you at this point but the idea with open telemetry is that out of the box the stuff it's running should be low overhead enough that you can run it in production so the idea is that open telemetry's instrumentation out of the box is trying to you know thread the needle of giving you enough information that it's useful without getting to the point where it's noisy and you wish it wasn't there cool and then says i see thanks no prop okay okey dokey let's let's get into phase two here so how do you roll how do you roll this stuff out so one question we have here is like where you have to ask when you're trying to roll this out and this especially goes for someone with like a giant code base right so you have a bajillion lines of code and not only do you have a bajillion lines of code maybe you have like a lot of teams right there's just it's not all your code there's other services it's a distributed system um and so there's a combination of like you have to do all that typing uh to get this stuff in everywhere and then you also have to get buy-in from other people in your engineering organization because you can't really take an ad hoc approach to tracing very well um and it is true that the bulk of the coding work is going to be in the instrumentation if you look at all of the code in open telemetry you know the sdks the clients the back ends like setting up and installing that stuff it really isn't a lot of code all the code is instrumentation whether that's the instrumentation libraries that we're trying to get added um or instrumentation code within your application that that's really where the bulk of the work is going to be so it is true if you have a very large code base you do you do want to be thoughtful about how you approach this and try to avoid boiling the ocean uh and so a good way to do that and to get buy-in is to find a real problem uh rather than trying to boil the ocean and and just get everything instrumented everywhere because that might fail identify um some kind of transaction that you care a lot about it could be something that already has some known latency or performance issue or something wrong with it and you'd like to get to the bottom of that but you've been having trouble that's the ideal thing if you don't have that then just starting with something that's high value picking some particular transaction path that's that's high value and just try to get that instrumented and when you do it um we tend to say bread first not depth first and what we mean is um try to get basic instrumentation added to each service you don't need to have really detailed instrumentation to get started with you just need to have the minimal amount you need to cover your framework your network requests and to get context propagation working and so it's a much better approach just to try to get those basics in there and get a complete trace on something you care about being reported to the system you're using uh to observe it uh if you have that kind of focus then that will allow you to to kind of get to the point where you're seeing value before you're doing all the work instead of trying to do all the work and then turn around and see what was valuable about it so i would say if you have a large code base this is the number one thing uh to look at um you know i might also mention um oh what was i supposed to say if you have a large code base then you you're gonna want to take this approach and you're also you're just i just really feel like i have to emphasize that it's not like you have to put in a bajillion logs for this stuff to be useful um like the standard attributes that you get are going to be useful uh the the basic course grain latency and error reporting you get is going to be useful all that is going to feel better than the way you were poking through logs and things before so so i really think you want to get there the other thing i think that's really important for large systems is to try to somehow centralize observability in some way and usually in order to do that you have to have other aspects of your um of your distributed system and your deployment operations and all of that centralized in some way so hopefully you have a giant code base but with lots of services but hopefully those services are all using a lot of common frameworks and components uh it's that centralization point that you want to go in first see if you can get all those shared components instrumented okay and then of course once you've got that first trace going you can then expand from there right start looking at other services um there's a thing we jokingly called blame driven development i'm really joking here but one thing you do find is people are trying to figure out where latency or an error is coming from and so they start tracing their services and they eliminate it coming from any of the services that they've traced and can clearly see it the issue has to be in some service downstream from the service services that they're looking at so you can use that tracing data to go to that services downstream from you and say hey look at this we're seeing you know these latency numbers and stuff we're pretty sure it's coming from over here can you add tracing to your service so that we can find out where in your service it is or if it's actually a service you're calling so that approach can is a nice way to to tie some value to the goal of adding tracing and how does it go wrong uh so many ways but i will say the thing we we see number one is because you can't really get started in a corner with tracing you you have to do it to some degree you have to have some continuous amount of stuff instrumented for it to be useful the way we see this go sad is that people perhaps one of you on this workshop are really interested in tracing you want to champion this stuff um but you don't necessarily have a lot of leverage to make that happen and um it starts happening in an ad hoc manner because you're going around and asking different teams could you please add this um and some might do it it might it comes out inconsistent and um it never reaches a critical mass and so that's really the thing to avoid is avoid kind of ad hoc adding tracing here and there without some kind of plan unless you have a very tiny system because you really want to make sure that when people are first trying out tracing and getting to know these tools they're working with good data like the tools are actually set up the way they're supposed to be and reporting the kind of data you'd want to see and that's easier to do if you as the person who's championing the tracing effort can kind of make sure that those initial traces that that get instrumented are are decent so that that i would say is the number one way things things go wrong because it's just a kind of unfocused effort um and yeah that's basically what i'm saying on this slide here right like um if you don't have like a project plan or project management that can make things go wrong because again you're dealing with multiple teams um a larger organization it's uh it's not necessarily a little library you can just start using in a corner um and as much as possible uh i recommend having some centralized resources in your organization about how to do this um so like um you know documentation but also helpers i find you can take a lot of this open telemetry stuff and it's meant to be applied to a lot of different situation situations but you may have just some very straightforward situations that you're doing over and over again and so helper libraries or anything that that makes it easier for other developers who are not observability nuts to to just start copy pasting code really really helps a lot so um those are the pitfalls and best practices of getting started with this stuff so i'm going to leave you with um a little roll out cheat sheet here so we're coming up to the end uh again just to go over the tops of this i do think that there are four languages that are getting to the point where you can use them in production go python java and javascript that's both browser and node.js versions of javascript um if you're trying to get buy-in from your org really starting from a known pain point or some kind of wind that you could get from tracing is more helpful than trying to boil the ocean you can find a lot of docs both on open telemetry i o and also otel.lightstep.com this is where i'm only just getting started here but i plan on adding a lot of resources to this so if you're interested in the resources are there and you want to see more please let me know i'll try to keep posting updates about this content that i'm creating on twitter so you can follow me there uh if you want to keep track of it and last but not least i didn't do a shout out about this before but i should do it here we we're starting a discord at lightstep to answer a lot of these kinds of questions so if you get done with this workshop and you have more questions or you're looking to actually get started with this stuff and uh you're running into some trouble um besides the open telemetry getter rooms you can come talk directly to us so i'll be in there austin parker will be in there other light steppers um so check out that discord i'm going to add this to our chat real quick so um that's another great resource if you're just trying to get some maybe some quick answers to some of these questions as you get rolling and that's what i got thank you all for coming i'm gonna um turn back to questions at this point uh but that's the that's the end of the official lecture so uh let me see if i can answer the questions already in here but uh if you have more questions please put them in otherwise it's been super great uh thank you so much for attending okay so kevin asks how does open telemetry account for clock skew among different servers it doesn't that is the answer there's there's no good way to account for that in the sense of trying to manipulate the data or somehow get a handle on what the cause of the clock skew is and correcting it uh i've actually seen some attempts at clock correction and distributed systems and it just doesn't work well because it's at the end of the day it's a heuristic you know like um the clock problem is a real fundamental problem and there isn't a way for open telemetry to get around it with some tricky clock correction now that said what open telemetry does is it just shows you the the timestamp numbers so if you do end up with clock skew between two different processes you can really see that in the trace it leaps out at you because you'll see that you'll have these spans that are obviously connected to each other and they're like ridiculously separated in time uh or they're they're inverted so that the that happens before looks like it's happening after and so by just honestly showing you that data i think that's the best thing open telemetry and tracing systems in general can do for something like clock skew uh not try to correct for it but just honestly show you the times so that you can you can figure it out for yourself itself that there's something going on but the clock skew won't affect things like um parent child relationships or the actual graph you're forming because that's not based on time stamps that's that's based on ids so so the clock skew issue isn't isn't super critical in that respect okay and this is going great oh thank you so much please do share the recording yes we're totally going to share this um and i plan on turning this into some youtube videos like content like this into some shorter youtube videos and written content as well so i definitely want to get this content out there and thank you so much for enjoying it and then the last question we have at this time is are launchers light steps specific so they're not uh the features that they offer right now the configuration options are tilted towards the stuff light step needs so that's why you have like a light step access token in there because we have to set that on a grpc header and it's like really obnoxious to do that um so the light step options the ones we care about are the ones that are baked into the launchers at this time but the launchers are 100 open telemetry compatible so just to be clear this is uh i've been promoting this idea of open telemetry distros and this is uh our version of what we think a distro should look like and one super important rule i have about distros is they can't fork open telemetry in some way or or block you off from using the other features so something you can do is you can use the launchers and if you need to grab that configuration stuff that they don't give you you can just use the vanilla open telemetry version of those configuration options like they will all still work and they don't um compete with the launcher so basically run the launcher first and then do any additional bespoke open telemetry configuration you want after you've uh used the launchers so um that's my recommendation and i am imagining other systems that uh do have some tricky things you might have to set up or access tokens or stuff like that uh or plug-ins or sampling uh i am hoping that those systems will come out with some kind of open telemetry distro so you don't have end users installing the exporter plug-in but forgetting to install the sampling plug-in and some crazy stuff happening so that's where i hope it goes
Info
Channel: Lightstep
Views: 2,956
Rating: undefined out of 5
Keywords: opentelemetry, APM, OTel, #OpenTelemetry, distributed tracing, #Golang, #Go
Id: yQpyIrdxmQc
Channel Id: undefined
Length: 128min 22sec (7702 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.